Reapply "Update link and minor edits"

kmoscoe · kmoscoe · commit aedb98308cca · 2025-04-21T10:55:05.000-07:00
This reverts commit 42f2672.
diff --git a/notebooks/v2/analyzing_obesity_prevalence.ipynb b/notebooks/v2/analyzing_obesity_prevalence.ipynb
@@ -7,7 +7,7 @@
         "id": "view-in-github"
       },
       "source": [
-        "<a href=\"https://colab.research.google.com/github/datacommonsorg/api-python/blob/master/notebooks/analyzing_obesity_prevalence.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
+        "<a href=\"https://colab.research.google.com/github/datacommonsorg/api-python/blob/master/notebooks/v2/analyzing_obesity_prevalence.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
       ]
     },
     {
@@ -57,7 +57,7 @@
         "id": "7SnIECsk7Csw"
       },
       "source": [
-        "# 1. Setup Environment\n"
+        "## 1. Set up environment\n"
       ]
     },
     {
@@ -66,7 +66,7 @@
         "id": "pysygfoq43NF"
       },
       "source": [
-        "### 1.1 Install Libraries\n",
+        "### 1.1 Install libraries\n",
         "\n",
         "Install the [datacommons-client](https://pypi.org/project/datacommons-client/) library."
       ]
@@ -92,7 +92,7 @@
         }
       ],
       "source": [
-        "!pip install datacommons-client --upgrade --quiet"
+        "!pip install \"datacommons-client[Pandas]\" --upgrade --quiet"
       ]
     },
     {
@@ -101,7 +101,7 @@
         "id": "BtLVyFoN5AiI"
       },
       "source": [
-        "### 1.2 Import Dependencies\n",
+        "### 1.2 Import dependencies\n",
         "\n",
         "Import required libraries for data manipulation, modeling, and plotting.\n"
       ]
@@ -134,7 +134,7 @@
         "id": "ZXzO6qSc5Xk0"
       },
       "source": [
-        "### 1.3 Initialize Data Commons Client\n",
+        "### 1.3 Initialize Data Commons client\n",
         "\n",
         "Initialize the client using your Data Commons API key. Obtain a key from [apikeys.datacommons.org](https://apikeys.datacommons.org/) if you don't have one.\n"
       ]
@@ -158,7 +158,7 @@
         "id": "Ccy9-czCfVTn"
       },
       "source": [
-        "## 2. Data Acquisition\n",
+        "## 2. Data acquisition\n",
         "\n",
         "Fetch statistical observations for the specified variables for all US counties for the year 2021 using the [Python Data Commons API](https://docs.datacommons.org/api/python/v2/)."
       ]
@@ -567,15 +567,15 @@
         "id": "z191ImVmrdds"
       },
       "source": [
-        "## 3. Data Preparation\n",
+        "## 3. Data preparation\n",
         "\n",
         "Process the fetched data for modeling:\n",
         "\n",
         "1. **Filter:** Keep only relevant observations based on their `measurementMethod`. For CDC data, this is typically `AgeAdjustedPrevalence`. For Census, `CensusACS5YearSurvey`, and for BLS, `BLSSeasonallyUnadjusted`.\n",
-        "1. **Select Columns:** Keep only essential columns: `entity`, `entity_name`, `variable`, `value`.\n",
+        "1. **Select columns:** Keep only essential columns: `entity`, `entity_name`, `variable`, `value`.\n",
         "1. **Pivot:** Reshape the dataframe so each variable becomes a column, indexed by county `entity` and `entity_name`.\n",
-        "1. **Calculate Poverty Rate:** Compute the poverty rate percentage using the population count and the count of people below the poverty level.\n",
-        "1. **Handle Missing Values:** Drop rows (counties) with any missing values for the selected variables.\n"
+        "1. **Calculate poverty rate:** Compute the poverty rate percentage using the population count and the count of people below the poverty level.\n",
+        "1. **Handle missing values:** Drop rows (counties) with any missing values for the selected variables.\n"
       ]
     },
     {
@@ -949,7 +949,7 @@
         }
       ],
       "source": [
-        "# Filter the dataframe to only include age adjusted values\n",
+        "# Filter the dataframe to only include age-adjusted values\n",
         "valid_methods = ['AgeAdjustedPrevalence', 'CensusACS5yrSurvey', 'CensusACS5YearSurvey', 'BLSSeasonallyUnadjusted'] # Add all the methods you want to keep\n",
         "\n",
         "filtered_df = us_county_observations_df.loc[us_county_observations_df['measurementMethod'].isin(valid_methods), ['entity', 'entity_name', 'variable', 'value']]\n",
@@ -975,7 +975,7 @@
         "id": "-ZGRFaJKdHIO"
       },
       "source": [
-        "## 4. Exploratory Data Analysis\n",
+        "## 4. Exploratory data analysis\n",
         "\n",
         "Visualize the relationships between the target variable (Obesity Prevalence) and the predictor variables (High Blood Pressure Prevalence, Unemployment Rate, Poverty Rate) using scatter plots. This helps assess potential correlations.\n"
       ]
@@ -1102,7 +1102,7 @@
         "id": "Bp52dWJNfYSa"
       },
       "source": [
-        "## 5. Model Training\n",
+        "## 5. Model training\n",
         "\n",
         "Train a linear regression model to predict obesity prevalence based on the selected predictors.\n",
         "\n",
@@ -1111,7 +1111,7 @@
         "$$f_\\theta(x) = \\theta_0 + \\theta_1 (\\text{high blood pressure}) + \\theta_2 (\\text{unemployment}) + \\theta_3(\\text{poverty rate})$$\n",
         "<br>\n",
         "\n",
-        "### 5.1 Prepare Features and Target Variable\n",
+        "### 5.1 Prepare features and target variable\n",
         "Define the feature matrix `X` (predictors) and the target vector `Y` (obesity prevalence).\n",
         "\n",
         "Let's start by creating our training and test sets. We'll then train a linear regression model using Scikit learn's [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)"
@@ -1137,7 +1137,7 @@
         "id": "rmidaLTx_6C9"
       },
       "source": [
-        "### 5.2 Split Data\n",
+        "### 5.2 Split data\n",
         "\n",
         "Split the data into training and testing sets (80% train, 20% test).\n",
         "\n"
@@ -1176,7 +1176,7 @@
         "id": "hu2t8OAGAGFp"
       },
       "source": [
-        "### 5.3 Train Linear Regression Model\n",
+        "### 5.3 Train linear regression model\n",
         "\n",
         "Instantiate and train the [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) model using the training data.\n",
         "\n"
@@ -1217,7 +1217,7 @@
         "id": "dBmThySxaXKp"
       },
       "source": [
-        "## 6. Model Evaluation\n",
+        "## 6. Model evaluation\n",
         "\n",
         "Assess the performance of the trained model using the Mean Squared Error (MSE) metric and residual analysis.\n",
         "\n",
@@ -1271,7 +1271,7 @@
         "id": "VsGLliuzawPE"
       },
       "source": [
-        "### 6.2 Analyze Residuals\n",
+        "### 6.2 Analyze residuals\n",
         "\n",
         "Calculate and plot the residuals (difference between predicted and actual values) for the test set. Residuals ideally should be randomly scattered around zero."
       ]
@@ -1330,23 +1330,23 @@
         "id": "qapl33x8fy_A"
       },
       "source": [
-        "## 7. Conclusion and Next Steps\n",
+        "## 7. Conclusion and next steps\n",
         "This notebook demonstrated the use of Data Commons to efficiently acquire data from multiple sources (CDC, BLS, Census) and build a simple linear regression model to predict obesity prevalence in US counties. Data Commons significantly streamlines the data gathering and integration process.\n",
         "\n",
         "The resulting model, using high blood pressure prevalence, unemployment rate, and poverty rate, provides a baseline prediction.\n",
         "\n",
-        "**Potential Improvements & Further Exploration:**\n",
+        "**Potential improvements & further exploration:**\n",
         "\n",
-        "* Add More Variables: Incorporate other variables known or hypothesized to correlate with obesity, such as:\n",
+        "* Add more variables: Incorporate other variables known or hypothesized to correlate with obesity, such as:\n",
         "  * `Percent_Person_WithHighCholesterol`\n",
         "  * `Percent_Person_WithDiabetes`\n",
         "  * Educational attainment levels\n",
         "  * Access to healthy food outlets\n",
         "  * Physical inactivity rates\n",
-        "* **Feature Engineering:** Create new features from existing ones.\n",
-        "* **Model Selection:** Experiment with different regression models (e.g., Ridge, Lasso, tree-based models).\n",
-        "* **Geographic Analysis:** Explore spatial patterns in obesity prevalence and model errors.\n",
-        "* **Alternative Data Sources:** Compare model performance using Census unemployment data instead of BLS data.\n",
+        "* **Feature engineering:** Create new features from existing ones.\n",
+        "* **Model selection:** Experiment with different regression models (e.g., Ridge, Lasso, tree-based models).\n",
+        "* **Geographic analysis:** Explore spatial patterns in obesity prevalence and model errors.\n",
+        "* **Alternative data sources:** Compare model performance using Census unemployment data instead of BLS data.\n",
         "Data Commons provides access to a wide range of variables, enabling exploration of correlations with factors like university counts, crime rates (e.g., arson), or environmental factors (e.g., snowfall), potentially leading to more comprehensive models."
       ]
     }