Skip to content

govex/cdc-wonder-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CDC WONDER Scraper

Python package for automated web scraping of CDC WONDER databases through the web interface using Selenium WebDriver.

About

This package was developed by the Bloomberg Center for Government Excellence (GovEx) Research & Analytics team to automate data collection from CDC WONDER. CDC WONDER provides access to a variety of public health databases, but presents challenges for accessing data for all counties across the U.S.:

  • County-level data is only available through web forms and is not included in bulk downloads
  • Strict row limits (~10,000 rows per query) require querying data in small chunks
  • Complete datasets can require hundreds of manual queries across all states, years, and demographics

This package automates the scraping process, enabling the collection of complete county datasets and reducing weeks of manual querying to hours.

Installation

Prerequisites:

  • Python 3.9+
  • Chrome browser installed

Install the package:

# Clone the repository
git clone https://github.com/govex/cdc-wonder-scraper.git
cd cdc-wonder-scraper

# Install with uv
uv pip install -e .

How It Works

The core function wonder_query() automates CDC WONDER queries using JSON configuration files that store form element IDs. It:

  1. Loads these element IDs from the config
  2. Uses Selenium to navigate the web form
  3. Fills standard elements (see standard_elements in constants.py)
  4. Selects Group By options and filters by year (and state if selected)
  5. Submits the query and downloads results

Customization: This package handles standard CDC WONDER form operations (terms agreement, Group By selections, year/state filters, result downloads). For database-specific filtering like death cause selection, use the custom_setup_func parameter (see Multiple Cause of Death example). For databases with non-standard form structures, you may need to adapt the wonder_query() function and configuration structure.

Quick Start

The Natality database example shows a simple query for births by county and gender across different year ranges:

python examples/Births-Natality_db/quickstart_pipeline.py

This queries years from three different database versions (2007-2024, 2003-2006, 1995-2002) and downloads county-level birth counts by gender.

To understand what's happening:

  • Run with headless=False to watch the browser navigate
  • Check form_config.json to see the element IDs
  • See quickstart_pipeline.py for the query logic

Create Your Own Scraper

1. Find Element IDs

To scrape a new CDC WONDER dataset, create a form_config.json with the correct element IDs:

  1. Open the CDC WONDER form in Chrome
  2. Right-click on a form element → Inspect
  3. Find the element's id attribute in the Elements panel:
    • <select id="SD66.V20"> → use SD66.V20
    • <option value="D66.V9-level2">County</option> → use D66.V9-level2
    • <label for="SD66.V20"> → use SD66.V20

2. Create Config File

Example form_config.json:

{
  "dataset_versions": {
    "my_dataset": {
      "base_url": "https://wonder.cdc.gov/my-database.html",
      "year_range": [2007, 2023],
      "skip": false,
      "year_selector": "SD69.V9",           // ID of year dropdown
      "groupby_options": {
        "County": "D69.V10-level2",         // ID of county option
        "Gender": "D69.V3"                  // ID of gender option
      },
      "queries": {
        "county_gender": ["County", "Gender"]
      }
    }
  }
}

3. Write Your Querying Script

from pathlib import Path
from cdc_wonder import setup_browser, wonder_query, load_form_config

# Load config
form_config = load_form_config('form_config.json')
dataset_config = form_config['dataset_versions']['my_dataset']

# Setup
download_path = Path("./downloads")
download_path.mkdir(exist_ok=True)
driver = setup_browser(headless=False, download_path=str(download_path))

# Run query
success, file = wonder_query(
    driver=driver,
    dataset_config=dataset_config,
    query_name='county_gender',
    year=2023,
    download_path=download_path
)

driver.quit()

Examples

See the examples/ directory for complete implementations:

  • Births-Natality_db - Queries the Natality db to pull birth counts
  • Deaths-Infant_Deaths_db/ - Queries the Linked Birth Infant Death Records db to pull infant mortality rates
  • Deaths-Multiple_Cause_of_Death_db/ - Queries the Multiple Cause of Death db to pull firearm mortality rates
  • Deaths-Underlying_Cause_of_Death_db/ - Queries the Underlying Cause of Death db to pull deaths by age

Each example includes:

  • form_config.json - Web form ID selectors
  • pipeline.py - Complete pipeline for executing the queries and downloading files

Core Functions

wonder_query()

Main function for executing CDC WONDER queries.

Parameters:

  • driver - Selenium WebDriver instance
  • dataset_config - Dataset configuration dict from form_config.json
  • query_name - Name of query to run (must exist in dataset_config['queries'])
  • year - Year to process (int)
  • download_path - Path object for download directory
  • state_name - Optional state name (None = all states) (e.g. Underlying Cause of Death example)
  • custom_setup_func - Optional function for custom UI interactions (e.g. Multiple Cause of Death example)
  • max_wait - Maximum seconds to wait for download (default: 60)

Returns:

  • (success: bool, downloaded_file: Path | None)

Browser Setup

setup_browser()

Create configured Chrome WebDriver instance.

Parameters:

  • headless - Run in headless mode (default: True)
  • download_path - Download directory path (default: "./temp_downloads")

Returns:

  • Configured WebDriver instance

Package Structure

cdc-wonder-scraper/
├── LICENSE
├── README.md
├── pyproject.toml
├── src/
│   └── cdc_wonder/
│       ├── __init__.py
│       ├── scraper.py           # Main wonder_query function
│       ├── browser.py           # Browser setup
│       ├── constants.py         # US_STATES, STANDARD_ELEMENTS
│       └── config.py            # Config utilities
├── tests/
│   └── test_import.py
└── examples/
    ├── Births-Natality_db/
    ├── Deaths-Infant_Deaths_db/
    ├── Deaths-Multiple_Cause_of_Death_db/
    └── Deaths-Underlying_Cause_of_Death_db/

Contributing

This is an early-stage project. Contributions and suggestions are welcome! Feel free to:

  • Open issues for bugs or feature requests
  • Submit pull requests with improvements
  • Share your CDC WONDER configurations and examples

Known Limitations

  • Requires Chrome browser
  • May timeout on large queries because of CDC WONDER rate limiting
  • Built and tested on Deaths and Births databases, may not work for other CDC WONDER databases (like the Environmental databases) that have different web form structures

License

MIT License. See LICENSE for details.

Authors

Developed by the Bloomberg Center for Government Excellence (GovEx) Research & Analytics team.

About

Python package for scraping CDC WONDER databases web interface

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages