|
7 | 7 | "id": "view-in-github" |
8 | 8 | }, |
9 | 9 | "source": [ |
10 | | - "<a href=\"https://colab.research.google.com/github/datacommonsorg/api-python/blob/master/notebooks/v2/analyzing_obesity_prevalence.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" |
| 10 | + "<a href=\"https://colab.research.google.com/github/datacommonsorg/api-python/blob/master/notebooks/analyzing_obesity_prevalence.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" |
11 | 11 | ] |
12 | 12 | }, |
13 | 13 | { |
|
57 | 57 | "id": "7SnIECsk7Csw" |
58 | 58 | }, |
59 | 59 | "source": [ |
60 | | - "## 1. Set up environment\n" |
| 60 | + "# 1. Setup Environment\n" |
61 | 61 | ] |
62 | 62 | }, |
63 | 63 | { |
|
66 | 66 | "id": "pysygfoq43NF" |
67 | 67 | }, |
68 | 68 | "source": [ |
69 | | - "### 1.1 Install libraries\n", |
| 69 | + "### 1.1 Install Libraries\n", |
70 | 70 | "\n", |
71 | 71 | "Install the [datacommons-client](https://pypi.org/project/datacommons-client/) library." |
72 | 72 | ] |
|
92 | 92 | } |
93 | 93 | ], |
94 | 94 | "source": [ |
95 | | - "!pip install \"datacommons-client[Pandas]\" --upgrade --quiet" |
| 95 | + "!pip install datacommons-client --upgrade --quiet" |
96 | 96 | ] |
97 | 97 | }, |
98 | 98 | { |
|
101 | 101 | "id": "BtLVyFoN5AiI" |
102 | 102 | }, |
103 | 103 | "source": [ |
104 | | - "### 1.2 Import dependencies\n", |
| 104 | + "### 1.2 Import Dependencies\n", |
105 | 105 | "\n", |
106 | 106 | "Import required libraries for data manipulation, modeling, and plotting.\n" |
107 | 107 | ] |
|
134 | 134 | "id": "ZXzO6qSc5Xk0" |
135 | 135 | }, |
136 | 136 | "source": [ |
137 | | - "### 1.3 Initialize Data Commons client\n", |
| 137 | + "### 1.3 Initialize Data Commons Client\n", |
138 | 138 | "\n", |
139 | 139 | "Initialize the client using your Data Commons API key. Obtain a key from [apikeys.datacommons.org](https://apikeys.datacommons.org/) if you don't have one.\n" |
140 | 140 | ] |
|
158 | 158 | "id": "Ccy9-czCfVTn" |
159 | 159 | }, |
160 | 160 | "source": [ |
161 | | - "## 2. Data acquisition\n", |
| 161 | + "## 2. Data Acquisition\n", |
162 | 162 | "\n", |
163 | 163 | "Fetch statistical observations for the specified variables for all US counties for the year 2021 using the [Python Data Commons API](https://docs.datacommons.org/api/python/v2/)." |
164 | 164 | ] |
|
567 | 567 | "id": "z191ImVmrdds" |
568 | 568 | }, |
569 | 569 | "source": [ |
570 | | - "## 3. Data preparation\n", |
| 570 | + "## 3. Data Preparation\n", |
571 | 571 | "\n", |
572 | 572 | "Process the fetched data for modeling:\n", |
573 | 573 | "\n", |
574 | 574 | "1. **Filter:** Keep only relevant observations based on their `measurementMethod`. For CDC data, this is typically `AgeAdjustedPrevalence`. For Census, `CensusACS5YearSurvey`, and for BLS, `BLSSeasonallyUnadjusted`.\n", |
575 | | - "1. **Select columns:** Keep only essential columns: `entity`, `entity_name`, `variable`, `value`.\n", |
| 575 | + "1. **Select Columns:** Keep only essential columns: `entity`, `entity_name`, `variable`, `value`.\n", |
576 | 576 | "1. **Pivot:** Reshape the dataframe so each variable becomes a column, indexed by county `entity` and `entity_name`.\n", |
577 | | - "1. **Calculate poverty rate:** Compute the poverty rate percentage using the population count and the count of people below the poverty level.\n", |
578 | | - "1. **Handle missing values:** Drop rows (counties) with any missing values for the selected variables.\n" |
| 577 | + "1. **Calculate Poverty Rate:** Compute the poverty rate percentage using the population count and the count of people below the poverty level.\n", |
| 578 | + "1. **Handle Missing Values:** Drop rows (counties) with any missing values for the selected variables.\n" |
579 | 579 | ] |
580 | 580 | }, |
581 | 581 | { |
|
949 | 949 | } |
950 | 950 | ], |
951 | 951 | "source": [ |
952 | | - "# Filter the dataframe to only include age-adjusted values\n", |
| 952 | + "# Filter the dataframe to only include age adjusted values\n", |
953 | 953 | "valid_methods = ['AgeAdjustedPrevalence', 'CensusACS5yrSurvey', 'CensusACS5YearSurvey', 'BLSSeasonallyUnadjusted'] # Add all the methods you want to keep\n", |
954 | 954 | "\n", |
955 | 955 | "filtered_df = us_county_observations_df.loc[us_county_observations_df['measurementMethod'].isin(valid_methods), ['entity', 'entity_name', 'variable', 'value']]\n", |
|
975 | 975 | "id": "-ZGRFaJKdHIO" |
976 | 976 | }, |
977 | 977 | "source": [ |
978 | | - "## 4. Exploratory data analysis\n", |
| 978 | + "## 4. Exploratory Data Analysis\n", |
979 | 979 | "\n", |
980 | 980 | "Visualize the relationships between the target variable (Obesity Prevalence) and the predictor variables (High Blood Pressure Prevalence, Unemployment Rate, Poverty Rate) using scatter plots. This helps assess potential correlations.\n" |
981 | 981 | ] |
|
1102 | 1102 | "id": "Bp52dWJNfYSa" |
1103 | 1103 | }, |
1104 | 1104 | "source": [ |
1105 | | - "## 5. Model training\n", |
| 1105 | + "## 5. Model Training\n", |
1106 | 1106 | "\n", |
1107 | 1107 | "Train a linear regression model to predict obesity prevalence based on the selected predictors.\n", |
1108 | 1108 | "\n", |
|
1111 | 1111 | "$$f_\\theta(x) = \\theta_0 + \\theta_1 (\\text{high blood pressure}) + \\theta_2 (\\text{unemployment}) + \\theta_3(\\text{poverty rate})$$\n", |
1112 | 1112 | "<br>\n", |
1113 | 1113 | "\n", |
1114 | | - "### 5.1 Prepare features and target variable\n", |
| 1114 | + "### 5.1 Prepare Features and Target Variable\n", |
1115 | 1115 | "Define the feature matrix `X` (predictors) and the target vector `Y` (obesity prevalence).\n", |
1116 | 1116 | "\n", |
1117 | 1117 | "Let's start by creating our training and test sets. We'll then train a linear regression model using Scikit learn's [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)" |
|
1137 | 1137 | "id": "rmidaLTx_6C9" |
1138 | 1138 | }, |
1139 | 1139 | "source": [ |
1140 | | - "### 5.2 Split data\n", |
| 1140 | + "### 5.2 Split Data\n", |
1141 | 1141 | "\n", |
1142 | 1142 | "Split the data into training and testing sets (80% train, 20% test).\n", |
1143 | 1143 | "\n" |
|
1176 | 1176 | "id": "hu2t8OAGAGFp" |
1177 | 1177 | }, |
1178 | 1178 | "source": [ |
1179 | | - "### 5.3 Train linear regression model\n", |
| 1179 | + "### 5.3 Train Linear Regression Model\n", |
1180 | 1180 | "\n", |
1181 | 1181 | "Instantiate and train the [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) model using the training data.\n", |
1182 | 1182 | "\n" |
|
1217 | 1217 | "id": "dBmThySxaXKp" |
1218 | 1218 | }, |
1219 | 1219 | "source": [ |
1220 | | - "## 6. Model evaluation\n", |
| 1220 | + "## 6. Model Evaluation\n", |
1221 | 1221 | "\n", |
1222 | 1222 | "Assess the performance of the trained model using the Mean Squared Error (MSE) metric and residual analysis.\n", |
1223 | 1223 | "\n", |
|
1271 | 1271 | "id": "VsGLliuzawPE" |
1272 | 1272 | }, |
1273 | 1273 | "source": [ |
1274 | | - "### 6.2 Analyze residuals\n", |
| 1274 | + "### 6.2 Analyze Residuals\n", |
1275 | 1275 | "\n", |
1276 | 1276 | "Calculate and plot the residuals (difference between predicted and actual values) for the test set. Residuals ideally should be randomly scattered around zero." |
1277 | 1277 | ] |
|
1330 | 1330 | "id": "qapl33x8fy_A" |
1331 | 1331 | }, |
1332 | 1332 | "source": [ |
1333 | | - "## 7. Conclusion and next steps\n", |
| 1333 | + "## 7. Conclusion and Next Steps\n", |
1334 | 1334 | "This notebook demonstrated the use of Data Commons to efficiently acquire data from multiple sources (CDC, BLS, Census) and build a simple linear regression model to predict obesity prevalence in US counties. Data Commons significantly streamlines the data gathering and integration process.\n", |
1335 | 1335 | "\n", |
1336 | 1336 | "The resulting model, using high blood pressure prevalence, unemployment rate, and poverty rate, provides a baseline prediction.\n", |
1337 | 1337 | "\n", |
1338 | | - "**Potential improvements & further exploration:**\n", |
| 1338 | + "**Potential Improvements & Further Exploration:**\n", |
1339 | 1339 | "\n", |
1340 | | - "* Add more variables: Incorporate other variables known or hypothesized to correlate with obesity, such as:\n", |
| 1340 | + "* Add More Variables: Incorporate other variables known or hypothesized to correlate with obesity, such as:\n", |
1341 | 1341 | " * `Percent_Person_WithHighCholesterol`\n", |
1342 | 1342 | " * `Percent_Person_WithDiabetes`\n", |
1343 | 1343 | " * Educational attainment levels\n", |
1344 | 1344 | " * Access to healthy food outlets\n", |
1345 | 1345 | " * Physical inactivity rates\n", |
1346 | | - "* **Feature engineering:** Create new features from existing ones.\n", |
1347 | | - "* **Model selection:** Experiment with different regression models (e.g., Ridge, Lasso, tree-based models).\n", |
1348 | | - "* **Geographic analysis:** Explore spatial patterns in obesity prevalence and model errors.\n", |
1349 | | - "* **Alternative data sources:** Compare model performance using Census unemployment data instead of BLS data.\n", |
| 1346 | + "* **Feature Engineering:** Create new features from existing ones.\n", |
| 1347 | + "* **Model Selection:** Experiment with different regression models (e.g., Ridge, Lasso, tree-based models).\n", |
| 1348 | + "* **Geographic Analysis:** Explore spatial patterns in obesity prevalence and model errors.\n", |
| 1349 | + "* **Alternative Data Sources:** Compare model performance using Census unemployment data instead of BLS data.\n", |
1350 | 1350 | "Data Commons provides access to a wide range of variables, enabling exploration of correlations with factors like university counts, crime rates (e.g., arson), or environmental factors (e.g., snowfall), potentially leading to more comprehensive models." |
1351 | 1351 | ] |
1352 | 1352 | } |
|
0 commit comments