|
33 | 33 | "\n", |
34 | 34 | "In this assignment, we'll be focusing on [linear regression](https://en.wikipedia.org/wiki/Linear_regression), which forms the basis for most regression models. In particular, we'll explore linear regression as a tool for _prediction_. We'll cover _interpreting_ regression models, in part 2.\n", |
35 | 35 | "\n", |
36 | | - "### Learning Objectives:\n", |
| 36 | + "### Learning objectives:\n", |
37 | 37 | "* Linear regression for prediction\n", |
38 | | - "* Mean-Squared error\n", |
| 38 | + "* Mean-qquared error\n", |
39 | 39 | "* In-sample vs out-of-sample prediction\n", |
40 | | - "* Single variate vs. multivariate Regression\n", |
| 40 | + "* Single variate vs. multivariate regression\n", |
41 | 41 | "* The effect of increasing variables\n", |
42 | 42 | "\n", |
43 | 43 | "---\n", |
|
47 | 47 | "\n", |
48 | 48 | "To build more familiarity with the Data Commons API, check out these [Data Commons tutorials](https://docs.datacommons.org/api/python/v2/tutorials.html).\n", |
49 | 49 | "\n", |
50 | | - "And for help with Pandas and manipulating data frames, take a look at the [Pandas Documentation](https://pandas.pydata.org/docs/reference/index.html).\n", |
| 50 | + "And for help with Pandas and manipulating data frames, take a look at the [Pandas documentation](https://pandas.pydata.org/docs/reference/index.html).\n", |
51 | 51 | "\n", |
52 | 52 | "We'll be using the scikit-learn library for implementing our models today. Documentation can be found [here](https://scikit-learn.org/stable/modules/classes.html).\n", |
53 | 53 | "\n", |
|
64 | 64 | "\n", |
65 | 65 | "### Introduction\n", |
66 | 66 | "\n", |
67 | | - "In this assignment, we'll be returning to the scenario we started analyzing in the [model evaluation assignment]() -- analyzing the [obesity epidemic in the United States](https://en.wikipedia.org/wiki/Obesity_in_the_United_States). Obesity rates vary across the nation by geographic location. In this Colab, we'll be exploring how obesity rates vary with different health or societal factors across US cities.\n", |
| 67 | + "In this assignment, we'll be returning to the scenario we started analyzing in the [model evaluation assignment](https://colab.research.google.com/github/datacommonsorg/api-python/blob/master/notebooks/v2/intro_data_science/Classification_and_Model_Evaluation.ipynb) -- analyzing the [obesity epidemic in the United States](https://en.wikipedia.org/wiki/Obesity_in_the_United_States). Obesity rates vary across the nation by geographic location. In this Colab, we'll be exploring how obesity rates vary with different health or societal factors across US cities.\n", |
68 | 68 | "\n", |
69 | 69 | "In the model evaluation assignment, we limited our analysis to high (<30%) and low (>30%) categories. Today we'll go one step further and predict the obesity rates themselves.\n", |
70 | 70 | "\n", |
71 | | - "Our data science question: **Can we predict the obesity rates of various US Cities based on other health or lifestyle factors?**\n", |
| 71 | + "Our data science question: **Can we predict the obesity rates of various US cities based on other health or lifestyle factors?**\n", |
72 | 72 | "\n", |
73 | 73 | "### Load the libraries and data\n", |
74 | 74 | "\n", |
|
121 | 121 | "id": "OjBcDnD4gnY_" |
122 | 122 | }, |
123 | 123 | "source": [ |
124 | | - "\n", |
125 | | - "\n", |
126 | 124 | "Run the following code box to load the data. We've done some basic data cleaning and manipulation for you, but look through the code to make sure you understand what's going on." |
127 | 125 | ] |
128 | 126 | }, |
|
562 | 560 | "Model A | `Count_Person` | `Percent_Person_Obesity`\n", |
563 | 561 | "Model B | `Percent_Person_PhysicalInactivity` | `Percent_Person_Obesity`\n", |
564 | 562 | "\n", |
565 | | - "**1A)** Just using your intuition, which model do you think will be better at predicting obesity rates? Why?\n", |
566 | | - "\n" |
| 563 | + "**1A)** Just using your intuition, which model do you think will be better at predicting obesity rates? Why?" |
567 | 564 | ] |
568 | 565 | }, |
569 | 566 | { |
|
1477 | 1474 | "\n", |
1478 | 1475 | "**1J)** For the model you selected in the question above, how much would you trust this model? What are its limitations?\n", |
1479 | 1476 | "\n", |
1480 | | - "**1K)**Can you think of any ways to create an even better model?\n" |
| 1477 | + "**1K)** Can you think of any ways to create an even better model?\n" |
1481 | 1478 | ] |
1482 | 1479 | }, |
1483 | 1480 | { |
|
2356 | 2353 | "id": "mtPbEzqvAWSe" |
2357 | 2354 | }, |
2358 | 2355 | "source": [ |
2359 | | - "**2B)** How does the out-of-sample RMSE compare with that of the single variable models A and B?\n", |
| 2356 | + "**2C)** How does the out-of-sample RMSE compare with that of the single variable models A and B?\n", |
2360 | 2357 | "\n", |
2361 | | - "**2C)** In general, how would you expect adding more variables to affect the resulting prediction error: increase, decrease, or no substantial change?" |
| 2358 | + "**2D)** In general, how would you expect adding more variables to affect the resulting prediction error: increase, decrease, or no substantial change?" |
2362 | 2359 | ] |
2363 | 2360 | }, |
2364 | 2361 | { |
|
2952 | 2949 | "id": "rhfPjm71Bhfa" |
2953 | 2950 | }, |
2954 | 2951 | "source": [ |
2955 | | - "**2D)** Take a look at the list of variables we'll be using this time. Do you think all of them will be useful/predictive?\n", |
| 2952 | + "**2E)** Take a look at the list of variables we'll be using this time. Do you think all of them will be useful/predictive?\n", |
2956 | 2953 | "\n", |
2957 | | - "**2E)** Based on your intuition, do you think adding all these models will help or hinder predictive accuracy?\n", |
| 2954 | + "**2F)** Based on your intuition, do you think adding all these models will help or hinder predictive accuracy?\n", |
2958 | 2955 | "\n", |
2959 | 2956 | "Let's now build a model and see what happens." |
2960 | 2957 | ] |
|
3955 | 3952 | "id": "qinowt_kB4oM" |
3956 | 3953 | }, |
3957 | 3954 | "source": [ |
3958 | | - "**2F)** How does the in-sample and out-of-sample RMSE compare with the smaller model from question 2A?\n", |
| 3955 | + "**2G)** How does the in-sample and out-of-sample RMSE compare with the smaller model from question 2A?\n", |
3959 | 3956 | "\n", |
3960 | | - "**2G)** Analyze the coefficients of the new larger regression model. Which variables seem to affect the prediction most?\n", |
| 3957 | + "**2H)** Analyze the coefficients of the new larger regression model. Which variables seem to affect the prediction most?\n", |
3961 | 3958 | "\n", |
3962 | | - "**2H)** Is it easy to tell? In other words, how interpretable is this model?" |
| 3959 | + "**2I)** Is it easy to tell? In other words, how interpretable is this model?" |
3963 | 3960 | ] |
3964 | 3961 | } |
3965 | 3962 | ], |
|
0 commit comments