EURAXESS Improved Search

A local search engine for academic job postings from EURAXESS.

This project crawls the EURAXESS jobs listing into a local SQLite database, classifies postings by role and topic, builds a hybrid search index, and serves the results through a local FastAPI app. It is designed to run on your machine, with the data and index stored under data/.

What it does

Builds a local copy of EURAXESS job postings
Supports incremental updates with HTTP caching headers
Classifies jobs by role, topic domain, and language
Provides hybrid search with SQLite FTS5 plus FAISS vectors
Serves a local web UI and JSON API
Exports data to JSONL or Parquet

Project layout

euraxess_scraper/
  cli.py
  config.py
  db.py
  discovery.py
  export.py
  fetch.py
  indexing.py
  language.py
  nli_classifier.py
  parse_job.py
  search.py
  taxonomy.py
  topics.py
  web/
    app.py
    templates/
  resources/
scripts/
  scheduled_update.sh
tests/
data/
  euraxess.db
  exports/
  index/

Installation

Requirements: Python 3.11 or newer.

Option A: `uv`

uv venv
source .venv/bin/activate
uv pip install -e .
uv pip install -e ".[dev]"

Option B: `pip`

python -m venv .venv
source .venv/bin/activate
pip install -e .
pip install -e ".[dev]"

Optional environment override:

export DATA_DIR=/absolute/path/to/euraxess-data

By default, the app stores its database, exports, and vector index under ./data.

Build the local database

A full first-time crawl takes a while. Expect a few hours for the full dataset at the default crawl rate.

python -m euraxess_scraper.cli crawl
python -m euraxess_scraper.cli reclassify
python -m euraxess_scraper.cli classify-topics
python -m euraxess_scraper.cli detect-language
python -m euraxess_scraper.cli build-index

If you already have a populated data/euraxess.db and data/index/, you can skip straight to serving or updating.

Run locally

Web UI

python -m euraxess_scraper.cli serve --host 127.0.0.1 --port 8000

Then open http://127.0.0.1:8000.

CLI search

python -m euraxess_scraper.cli search --query "machine learning postdoc germany"
python -m euraxess_scraper.cli search --query "quantum optics" --job-type postdoc
python -m euraxess_scraper.cli search --query "sociology phd" --topic social_sciences
python -m euraxess_scraper.cli search --query "bioinformatics" --country Germany

Manual update

Fast incremental refresh:

python -m euraxess_scraper.cli update --max-pages 30 --no-delist
python -m euraxess_scraper.cli reclassify
python -m euraxess_scraper.cli classify-topics --only-missing
python -m euraxess_scraper.cli detect-language --only-missing
python -m euraxess_scraper.cli build-index

Full refresh:

python -m euraxess_scraper.cli update
python -m euraxess_scraper.cli reclassify
python -m euraxess_scraper.cli classify-topics --only-missing
python -m euraxess_scraper.cli detect-language --only-missing
python -m euraxess_scraper.cli build-index

Or use the helper script:

./scripts/scheduled_update.sh

You can wire that script into your own cron job or another local scheduler if you want recurring refreshes.

Useful commands

# Crawl and update
python -m euraxess_scraper.cli crawl
python -m euraxess_scraper.cli crawl --dry-run
python -m euraxess_scraper.cli update
python -m euraxess_scraper.cli update --max-pages 30 --no-delist

# Classification
python -m euraxess_scraper.cli reclassify
python -m euraxess_scraper.cli nli-classify-type
python -m euraxess_scraper.cli classify-topics --only-missing
python -m euraxess_scraper.cli detect-language --only-missing

# Index and search
python -m euraxess_scraper.cli build-index
python -m euraxess_scraper.cli search --query "..."
python -m euraxess_scraper.cli eval-search --gold tests/fixtures/gold_queries.yaml

# Serve the local app
python -m euraxess_scraper.cli serve

# Data and exports
python -m euraxess_scraper.cli stats
python -m euraxess_scraper.cli export --format jsonl --output data/exports/jobs.jsonl
python -m euraxess_scraper.cli export --format parquet --output data/exports/jobs.parquet

Web endpoints

Endpoint	Description
`GET /`	Search UI
`GET /search`	HTML search results
`GET /api/search`	JSON search API
`GET /jobs/{job_id}`	Job detail page
`GET /api/jobs/{job_id}`	Job detail as JSON

Search parameters for /search and /api/search:

Parameter	Type	Description
`q`	string	Free-text query
`job_type`	string	`all` \| `postdoc` \| `phd` \| `professor` \| `other` \| `unknown`
`include_topic`	string (repeatable)	Include jobs in these topic domains
`exclude_topic`	string (repeatable)	Exclude jobs in these topic domains
`country`	string	Country filter
`language`	string	ISO 639-1 code such as `en`, `fr`, or `de`
`page`	int	Page number
`page_size`	int	Results per page, max `100`
`active_only`	bool	Exclude delisted jobs
`open_only`	bool	Exclude past-deadline jobs

Topic keys: computer_science, natural_sciences, engineering_technology, medical_health, agricultural_veterinary, social_sciences, humanities_arts, other

Rate limiting

The crawler is intentionally conservative:

1.0 request per second globally
Up to 3 concurrent workers
Random jitter between requests
Retry backoff for transient failures
Automatic halt after too many consecutive failures
robots.txt check at startup

Lower the request rate further if you are doing long unattended runs.

Testing

pytest
pytest --integration -m integration

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
euraxess_scraper		euraxess_scraper
scripts		scripts
tests		tests
.gitignore		.gitignore
.python-version		.python-version
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EURAXESS Improved Search

What it does

Project layout

Installation

Option A: `uv`

Option B: `pip`

Build the local database

Run locally

Web UI

CLI search

Manual update

Useful commands

Web endpoints

Rate limiting

Testing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EURAXESS Improved Search

What it does

Project layout

Installation

Option A: uv

Option B: pip

Build the local database

Run locally

Web UI

CLI search

Manual update

Useful commands

Web endpoints

Rate limiting

Testing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Option A: `uv`

Option B: `pip`

Packages