Primary Data Sources

Source Data Years Variables Coverage
NCI State Cancer Profiles Age-adjusted cancer incidence & mortality rates 2016–2020 ~90 cancer types/metrics 3,029 counties (all-site)
USGS PNSP Agricultural pesticide application estimates 2019 + historical (1997–2017) Total, herbicide, insecticide, fungicide + 12 compounds 3,063 counties
Census ACS Demographic & socioeconomic characteristics 2018–2022 (5-year) Income, poverty, education, age, race/ethnicity 3,222 counties
CDC PLACES Health behavior & chronic disease prevalence 2023 release Smoking, obesity, drinking, diabetes, physical inactivity 2,957 counties
USDA NASS Livestock inventories & crop acreage 2022 Census of Agriculture Hogs, cattle, chickens, total crop acres 2,890–3,060 counties
EPA AQS PM2.5 air quality (annual mean) 2019 Fine particulate matter (μg/m³) ~800 monitor counties
USGS WQP Nitrate water contamination 2015–2023 Mean nitrate (mg/L) ~1,500 counties
CDC EPHT Temporal cancer incidence panel 2001–2020 (16 windows) 9 cancer types, age-adjusted rates 2,727 counties
County Health Rankings Historical health behavior trends 2010–2024 Smoking, obesity, drinking, inactivity over time ~3,000 counties
CDC WONDER Cancer mortality trends 1999–2022 Age-adjusted mortality rates by cause ~3,100 counties
EPA SDWIS Drinking water violations 2010–2023 MCL violations, monitoring violations All public water systems

Coverage Summary

Total Counties
3,248
Cross-sectional dataset (v4)
Variables
160+
Cancer, pesticide, demographic, health, agricultural
Cancer Types
44
Incidence + mortality for ~22 site-specific types
Temporal Panel
2,727
Counties with temporal cancer data (16 windows)

Key Variable Coverage

Coverage varies by data source. Cancer rates for rare types have more suppressed (missing) counties due to NCI small-count suppression rules. The table below shows the 20 most complete variables.

Variable Counties Available Missing Missing %
Rural-Urban Continuum Code3,233150.5%
Median household income3,222260.8%
Total population3,222260.8%
Median age3,222260.8%
Poverty rate3,222260.8%
Total pesticide (kg)3,0631855.7%
Pesticide density (kg/mi²)3,0551935.9%
Cattle inventory3,0601885.8%
Cancer rate (all-site)3,0292196.7%
Smoking prevalence2,9572919.0%
Obesity prevalence2,9572919.0%
Cancer rate (colorectal)2,70654216.7%
Cancer rate (kidney)2,28096829.8%
Cancer rate (NHL)2,1691,07933.2%
Cancer rate (leukemia)1,9671,28139.4%

Data Suppression

NCI suppresses cancer rates for counties with fewer than 16 cases over the reporting period to protect patient privacy. This means rarer cancer types (leukemia, NHL, bladder) have substantially more missing counties, which reduces statistical power and limits BYM2 analyses to the largest connected component of the county adjacency graph.

Correlation Structure

Correlation heatmap of key variables
Figure 1. Spearman correlation heatmap for key exposure and outcome variables. Cancer rates correlate strongly with smoking and obesity but show weaker, positive correlations with pesticide density. This motivates multivariate and spatial approaches.

Data Processing Pipeline

Raw data from each source is downloaded via API or web scraping (notebooks 01–01e), then cleaned and merged on 5-digit FIPS codes (notebooks 02–02e). Key processing steps: