Python package for automated web scraping of CDC WONDER databases through the web interface using Selenium WebDriver.
This package was developed by the Bloomberg Center for Government Excellence (GovEx) Research & Analytics team to automate data collection from CDC WONDER. CDC WONDER provides access to a variety of public health databases, but presents challenges for accessing data for all counties across the U.S.:
- County-level data is only available through web forms and is not included in bulk downloads
- Strict row limits (~10,000 rows per query) require querying data in small chunks
- Complete datasets can require hundreds of manual queries across all states, years, and demographics
This package automates the scraping process, enabling the collection of complete county datasets and reducing weeks of manual querying to hours.
Prerequisites:
- Python 3.9+
- Chrome browser installed
Install the package:
# Clone the repository
git clone https://github.com/govex/cdc-wonder-scraper.git
cd cdc-wonder-scraper
# Install with uv
uv pip install -e .The core function wonder_query() automates CDC WONDER queries using JSON configuration files that store form element IDs. It:
- Loads these element IDs from the config
- Uses Selenium to navigate the web form
- Fills standard elements (see
standard_elementsinconstants.py) - Selects Group By options and filters by year (and state if selected)
- Submits the query and downloads results
Customization: This package handles standard CDC WONDER form operations (terms agreement, Group By selections, year/state filters, result downloads). For database-specific filtering like death cause selection, use the custom_setup_func parameter (see Multiple Cause of Death example). For databases with non-standard form structures, you may need to adapt the wonder_query() function and configuration structure.
The Natality database example shows a simple query for births by county and gender across different year ranges:
python examples/Births-Natality_db/quickstart_pipeline.pyThis queries years from three different database versions (2007-2024, 2003-2006, 1995-2002) and downloads county-level birth counts by gender.
To understand what's happening:
- Run with
headless=Falseto watch the browser navigate - Check
form_config.jsonto see the element IDs - See
quickstart_pipeline.pyfor the query logic
To scrape a new CDC WONDER dataset, create a form_config.json with the correct element IDs:
- Open the CDC WONDER form in Chrome
- Right-click on a form element → Inspect
- Find the element's
idattribute in the Elements panel:<select id="SD66.V20">→ useSD66.V20<option value="D66.V9-level2">County</option>→ useD66.V9-level2<label for="SD66.V20">→ useSD66.V20
Example form_config.json:
{
"dataset_versions": {
"my_dataset": {
"base_url": "https://wonder.cdc.gov/my-database.html",
"year_range": [2007, 2023],
"skip": false,
"year_selector": "SD69.V9", // ID of year dropdown
"groupby_options": {
"County": "D69.V10-level2", // ID of county option
"Gender": "D69.V3" // ID of gender option
},
"queries": {
"county_gender": ["County", "Gender"]
}
}
}
}from pathlib import Path
from cdc_wonder import setup_browser, wonder_query, load_form_config
# Load config
form_config = load_form_config('form_config.json')
dataset_config = form_config['dataset_versions']['my_dataset']
# Setup
download_path = Path("./downloads")
download_path.mkdir(exist_ok=True)
driver = setup_browser(headless=False, download_path=str(download_path))
# Run query
success, file = wonder_query(
driver=driver,
dataset_config=dataset_config,
query_name='county_gender',
year=2023,
download_path=download_path
)
driver.quit()See the examples/ directory for complete implementations:
- Births-Natality_db - Queries the Natality db to pull birth counts
- Deaths-Infant_Deaths_db/ - Queries the Linked Birth Infant Death Records db to pull infant mortality rates
- Deaths-Multiple_Cause_of_Death_db/ - Queries the Multiple Cause of Death db to pull firearm mortality rates
- Deaths-Underlying_Cause_of_Death_db/ - Queries the Underlying Cause of Death db to pull deaths by age
Each example includes:
form_config.json- Web form ID selectorspipeline.py- Complete pipeline for executing the queries and downloading files
Main function for executing CDC WONDER queries.
Parameters:
driver- Selenium WebDriver instancedataset_config- Dataset configuration dict from form_config.jsonquery_name- Name of query to run (must exist in dataset_config['queries'])year- Year to process (int)download_path- Path object for download directorystate_name- Optional state name (None = all states) (e.g. Underlying Cause of Death example)custom_setup_func- Optional function for custom UI interactions (e.g. Multiple Cause of Death example)max_wait- Maximum seconds to wait for download (default: 60)
Returns:
(success: bool, downloaded_file: Path | None)
Create configured Chrome WebDriver instance.
Parameters:
headless- Run in headless mode (default: True)download_path- Download directory path (default: "./temp_downloads")
Returns:
- Configured WebDriver instance
cdc-wonder-scraper/
├── LICENSE
├── README.md
├── pyproject.toml
├── src/
│ └── cdc_wonder/
│ ├── __init__.py
│ ├── scraper.py # Main wonder_query function
│ ├── browser.py # Browser setup
│ ├── constants.py # US_STATES, STANDARD_ELEMENTS
│ └── config.py # Config utilities
├── tests/
│ └── test_import.py
└── examples/
├── Births-Natality_db/
├── Deaths-Infant_Deaths_db/
├── Deaths-Multiple_Cause_of_Death_db/
└── Deaths-Underlying_Cause_of_Death_db/
This is an early-stage project. Contributions and suggestions are welcome! Feel free to:
- Open issues for bugs or feature requests
- Submit pull requests with improvements
- Share your CDC WONDER configurations and examples
- Requires Chrome browser
- May timeout on large queries because of CDC WONDER rate limiting
- Built and tested on Deaths and Births databases, may not work for other CDC WONDER databases (like the Environmental databases) that have different web form structures
MIT License. See LICENSE for details.
Developed by the Bloomberg Center for Government Excellence (GovEx) Research & Analytics team.