CDC WONDER Scraper

Python package for automated web scraping of CDC WONDER databases through the web interface using Selenium WebDriver.

About

This package was developed by the Bloomberg Center for Government Excellence (GovEx) Research & Analytics team to automate data collection from CDC WONDER. CDC WONDER provides access to a variety of public health databases, but presents challenges for accessing data for all counties across the U.S.:

County-level data is only available through web forms and is not included in bulk downloads
Strict row limits (~10,000 rows per query) require querying data in small chunks
Complete datasets can require hundreds of manual queries across all states, years, and demographics

This package automates the scraping process, enabling the collection of complete county datasets and reducing weeks of manual querying to hours.

Installation

Prerequisites:

Python 3.9+
Chrome browser installed

Install the package:

# Clone the repository
git clone https://github.com/govex/cdc-wonder-scraper.git
cd cdc-wonder-scraper

# Install with uv
uv pip install -e .

How It Works

The core function wonder_query() automates CDC WONDER queries using JSON configuration files that store form element IDs. It:

Loads these element IDs from the config
Uses Selenium to navigate the web form
Fills standard elements (see standard_elements in constants.py)
Selects Group By options and filters by year (and state if selected)
Submits the query and downloads results

Customization: This package handles standard CDC WONDER form operations (terms agreement, Group By selections, year/state filters, result downloads). For database-specific filtering like death cause selection, use the custom_setup_func parameter (see Multiple Cause of Death example). For databases with non-standard form structures, you may need to adapt the wonder_query() function and configuration structure.

Quick Start

The Natality database example shows a simple query for births by county and gender across different year ranges:

python examples/Births-Natality_db/quickstart_pipeline.py

This queries years from three different database versions (2007-2024, 2003-2006, 1995-2002) and downloads county-level birth counts by gender.

To understand what's happening:

Run with headless=False to watch the browser navigate
Check form_config.json to see the element IDs
See quickstart_pipeline.py for the query logic

Create Your Own Scraper

1. Find Element IDs

To scrape a new CDC WONDER dataset, create a form_config.json with the correct element IDs:

Open the CDC WONDER form in Chrome
Right-click on a form element → Inspect
Find the element's id attribute in the Elements panel:
- <select id="SD66.V20"> → use SD66.V20
- <option value="D66.V9-level2">County</option> → use D66.V9-level2
- <label for="SD66.V20"> → use SD66.V20

2. Create Config File

Example form_config.json:

{
  "dataset_versions": {
    "my_dataset": {
      "base_url": "https://wonder.cdc.gov/my-database.html",
      "year_range": [2007, 2023],
      "skip": false,
      "year_selector": "SD69.V9",           // ID of year dropdown
      "groupby_options": {
        "County": "D69.V10-level2",         // ID of county option
        "Gender": "D69.V3"                  // ID of gender option
      },
      "queries": {
        "county_gender": ["County", "Gender"]
      }
    }
  }
}

3. Write Your Querying Script

from pathlib import Path
from cdc_wonder import setup_browser, wonder_query, load_form_config

# Load config
form_config = load_form_config('form_config.json')
dataset_config = form_config['dataset_versions']['my_dataset']

# Setup
download_path = Path("./downloads")
download_path.mkdir(exist_ok=True)
driver = setup_browser(headless=False, download_path=str(download_path))

# Run query
success, file = wonder_query(
    driver=driver,
    dataset_config=dataset_config,
    query_name='county_gender',
    year=2023,
    download_path=download_path
)

driver.quit()

Examples

See the examples/ directory for complete implementations:

Births-Natality_db - Queries the Natality db to pull birth counts
Deaths-Infant_Deaths_db/ - Queries the Linked Birth Infant Death Records db to pull infant mortality rates
Deaths-Multiple_Cause_of_Death_db/ - Queries the Multiple Cause of Death db to pull firearm mortality rates
Deaths-Underlying_Cause_of_Death_db/ - Queries the Underlying Cause of Death db to pull deaths by age

Each example includes:

form_config.json - Web form ID selectors
pipeline.py - Complete pipeline for executing the queries and downloading files

Core Functions

`wonder_query()`

Main function for executing CDC WONDER queries.

Parameters:

driver - Selenium WebDriver instance
dataset_config - Dataset configuration dict from form_config.json
query_name - Name of query to run (must exist in dataset_config['queries'])
year - Year to process (int)
download_path - Path object for download directory
state_name - Optional state name (None = all states) (e.g. Underlying Cause of Death example)
custom_setup_func - Optional function for custom UI interactions (e.g. Multiple Cause of Death example)
max_wait - Maximum seconds to wait for download (default: 60)

Returns:

(success: bool, downloaded_file: Path | None)

Browser Setup

`setup_browser()`

Create configured Chrome WebDriver instance.

Parameters:

headless - Run in headless mode (default: True)
download_path - Download directory path (default: "./temp_downloads")

Returns:

Configured WebDriver instance

Package Structure

cdc-wonder-scraper/
├── LICENSE
├── README.md
├── pyproject.toml
├── src/
│   └── cdc_wonder/
│       ├── __init__.py
│       ├── scraper.py           # Main wonder_query function
│       ├── browser.py           # Browser setup
│       ├── constants.py         # US_STATES, STANDARD_ELEMENTS
│       └── config.py            # Config utilities
├── tests/
│   └── test_import.py
└── examples/
    ├── Births-Natality_db/
    ├── Deaths-Infant_Deaths_db/
    ├── Deaths-Multiple_Cause_of_Death_db/
    └── Deaths-Underlying_Cause_of_Death_db/

Contributing

This is an early-stage project. Contributions and suggestions are welcome! Feel free to:

Open issues for bugs or feature requests
Submit pull requests with improvements
Share your CDC WONDER configurations and examples

Known Limitations

Requires Chrome browser
May timeout on large queries because of CDC WONDER rate limiting
Built and tested on Deaths and Births databases, may not work for other CDC WONDER databases (like the Environmental databases) that have different web form structures

License

MIT License. See LICENSE for details.

Authors

Developed by the Bloomberg Center for Government Excellence (GovEx) Research & Analytics team.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CDC WONDER Scraper

About

Installation

How It Works

Quick Start

Create Your Own Scraper

1. Find Element IDs

2. Create Config File

3. Write Your Querying Script

Examples

Core Functions

`wonder_query()`

Browser Setup

`setup_browser()`

Package Structure

Contributing

Known Limitations

License

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
src/cdc_wonder		src/cdc_wonder
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

CDC WONDER Scraper

About

Installation

How It Works

Quick Start

Create Your Own Scraper

1. Find Element IDs

2. Create Config File

3. Write Your Querying Script

Examples

Core Functions

wonder_query()

Browser Setup

setup_browser()

Package Structure

Contributing

Known Limitations

License

Authors

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`wonder_query()`

`setup_browser()`

Packages