You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<p>No! Some previous research tells us that using sunscreen actually reduces skin cancer risk, so maybe there is another variable that can explain this hypothetical association between sunscreen usage and skin cancer, as shown in <ahref="#fig-sun-causes-cancer" class="quarto-xref">Figure <span>2.7</span></a>. One important piece of information that is absent is sun exposure. If someone is out in the sun all day, they are more likely to use sunscreen <em>and</em> more likely to get skin cancer. Exposure to the sun is unaccounted for in the simple observational investigation.</p>
837
837
<divclass="cell" data-layout-align="center">
838
838
<divclass="cell-output-display">
839
-
<divid="fig-sun-causes-cancer" class="quarto-float quarto-figure quarto-figure-center anchored" data-fig-align="center" alt="Three boxes are shown in a triangle arrangement representing: sun exposure, using sunscreen, and skin cancer. A solid arrow connects sun exposure as a causal mechanism to using sunscreen; a solid arrow also connects sun exposure as a causal mechanism to skin cancer. A questioning arrow indicates that the causal effect of using sunscreen on skin cancer is unknown. ">
839
+
<divid="fig-sun-causes-cancer" class="quarto-float quarto-figure quarto-figure-center anchored" alt="Three boxes are shown in a triangle arrangement representing: sun exposure, using sunscreen, and skin cancer. A solid arrow connects sun exposure as a causal mechanism to using sunscreen; a solid arrow also connects sun exposure as a causal mechanism to skin cancer. A questioning arrow indicates that the causal effect of using sunscreen on skin cancer is unknown. " data-fig-align="center">
<ahref="data-design_files/figure-html/fig-sun-causes-cancer-1.png" class="lightbox" data-gallery="quarto-lightbox-gallery-7" title="Figure 2.7: Sun exposure may be the root cause of both sunscreen use and skin cancer."><imgsrc="data-design_files/figure-html/fig-sun-causes-cancer-1.png" class="img-fluid quarto-figure quarto-figure-center figure-img" style="width:60.0%" alt="Three boxes are shown in a triangle arrangement representing: sun exposure, using sunscreen, and skin cancer. A solid arrow connects sun exposure as a causal mechanism to using sunscreen; a solid arrow also connects sun exposure as a causal mechanism to skin cancer. A questioning arrow indicates that the causal effect of using sunscreen on skin cancer is unknown. "></a>
<p>A proficient analyst will have a good sense of the types of data they are working with and how to visualize the data in order to gain a complete understanding of the variables. Equally important, however, is the data source. In this chapter, we have discussed randomized experiments and taking good, random, representative samples from a population. When we discuss inferential methods (starting in <ahref="foundations-randomization.html" class="quarto-xref"><span>Chapter 11</span></a>), the conclusions that can be drawn will be dependent on how the data were collected. <ahref="#fig-randsampValloc" class="quarto-xref">Figure <span>2.8</span></a> summarizes how sampling and assignment methods relate to the scope of inference.<ahref="#fn10" class="footnote-ref" id="fnref10" role="doc-noteref"><sup>10</sup></a> Regularly revisiting <ahref="#fig-randsampValloc" class="quarto-xref">Figure <span>2.8</span></a> will be important when making conclusions from a given data analysis.</p>
859
859
<divclass="cell">
860
860
<divclass="cell-output-display">
861
-
<divid="fig-randsampValloc" class="quarto-float quarto-figure quarto-figure-center anchored" alt="A two by two table describing the scenarios of random sample or not and random allocation or not. Selecting randomly from a population allows for generalization back to the population. Randomly allocating in an experiment allows for establishing causation. " data-fig-pos="H">
861
+
<divid="fig-randsampValloc" class="quarto-float quarto-figure quarto-figure-center anchored" data-fig-pos="H" alt="A two by two table describing the scenarios of random sample or not and random allocation or not. Selecting randomly from a population allows for generalization back to the population. Randomly allocating in an experiment allows for establishing causation. ">
<ahref="images/randsampValloc.png" class="lightbox" data-gallery="quarto-lightbox-gallery-8" title="Figure 2.8: Analysis conclusions should be made carefully according to how the data were collected. Very few datasets come from the top left box because usually ethics require that random assignment of treatments can only be given to volunteers. Both representative (ideally random) sampling and experiments (random assignment of treatments) are important for how statistical conclusions can be made on populations."><imgsrc="images/randsampValloc.png" class="img-fluid figure-img" style="width:96.0%" data-fig-pos="H" alt="A two by two table describing the scenarios of random sample or not and random allocation or not. Selecting randomly from a population allows for generalization back to the population. Randomly allocating in an experiment allows for establishing causation. "></a>
<p>It might be a little easier to review the results using a visualization. <ahref="#fig-opportunity-cost-obs-bar" class="quarto-xref">Figure <span>11.5</span></a> shows that a higher proportion of students in the treatment group chose not to buy the video compared to those in the control group.</p>
983
983
<divclass="cell">
984
984
<divclass="cell-output-display">
985
-
<divid="fig-opportunity-cost-obs-bar" class="quarto-float quarto-figure quarto-figure-center anchored" alt="Stacked bar plot with groups of control and treatment and filled using the proportion who did and did not buy the video. 74% of the control group bought the video as compared with a little over 50% of the treatment group who bought the video. " data-fig-pos="H">
985
+
<divid="fig-opportunity-cost-obs-bar" class="quarto-float quarto-figure quarto-figure-center anchored" data-fig-pos="H" alt="Stacked bar plot with groups of control and treatment and filled using the proportion who did and did not buy the video. 74% of the control group bought the video as compared with a little over 50% of the treatment group who bought the video. ">
<ahref="foundations-randomization_files/figure-html/fig-opportunity-cost-obs-bar-1.png" class="lightbox" data-gallery="quarto-lightbox-gallery-5" title="Figure 11.5: Stacked bar plot of results of the opportunity cost study."><imgsrc="foundations-randomization_files/figure-html/fig-opportunity-cost-obs-bar-1.png" class="img-fluid figure-img" style="width:90.0%" data-fig-pos="H" alt="Stacked bar plot with groups of control and treatment and filled using the proportion who did and did not buy the video. 74% of the control group bought the video as compared with a little over 50% of the treatment group who bought the video. "></a>
<spanclass="header-section-number">26.1</span> Model diagnostics</h2>
715
-
<p>Before looking at the hypothesis tests associated with the coefficients (turns out they are very similar to those in linear regression!), it is valuable to understand the technical conditions that underlie the inference applied to the logistic regression model. Generally, as you’ve seen in the logistic regression modeling examples, it is imperative that the response variable is binary. Additionally, the key technical condition for logistic regression has to do with the relationship between the predictor variables <spanclass="math inline">\((x_i\)</span> values) and the probability the outcome will be a success. It turns out, the relationship is a specific functional form called a logit function, where <spanclass="math inline">\({\rm logit}(p) = \log_e(\frac{p}{1-p}).\)</span> The function may feel complicated, and memorizing the formula of the logit is not necessary for understanding logistic regression. What you do need to remember is that the probability of the outcome being a success is a function of a linear combination of the explanatory variables.</p>
715
+
<p>Before looking at the hypothesis tests associated with the coefficients (turns out they are very similar to those in linear regression!), it is valuable to understand the technical conditions that underlie the inference applied to the logistic regression model. Generally, as you’ve seen in the logistic regression modeling examples, it is imperative that the response variable is binary. Additionally, the key technical condition for logistic regression has to do with the relationship between the predictor variables (<spanclass="math inline">\(x_i\)</span> values) and the probability the outcome will be a success. It turns out, the relationship is a specific functional form called a logit function, where <spanclass="math inline">\({\rm logit}(p) = \log_e(\frac{p}{1-p}).\)</span> The function may feel complicated, and memorizing the formula of the logit is not necessary for understanding logistic regression. What you do need to remember is that the probability of the outcome being a success is a function of a linear combination of the explanatory variables.</p>
<spanclass="header-section-number">25.1</span> Multiple regression output from software</h2>
640
640
<p>Recall the <code>loans</code> data from <ahref="model-mlr.html" class="quarto-xref">Chapter <span>8</span></a>.</p>
641
641
<divclass="data">
642
-
<p>The <ahref="http://openintrostat.github.io/openintro/reference/loans_full_schema.html"><code>loans_full_schema</code></a> data can be found in the <ahref="http://openintrostat.github.io/openintro"><strong>openintro</strong></a> R package. Based on the data in this dataset we have created two new variables: <code>credit_util</code> which is calculated as the total credit utilized divided by the total credit limit and <code>bankruptcy</code> which turns the number of bankruptcies to an indicator variable (0 for no bankruptcies and 1 for at least 1 bankruptcies). We will refer to this modified dataset as <code>loans</code>.</p>
642
+
<p>The <ahref="http://openintrostat.github.io/openintro/reference/loans_full_schema.html"><code>loans_full_schema</code></a> data can be found in the <ahref="http://openintrostat.github.io/openintro"><strong>openintro</strong></a> R package. Based on the data in this dataset we have created two new variables: <code>credit_util</code> which is calculated as the total credit utilized divided by the total credit limit and <code>bankruptcy</code> which turns the number of bankruptcies to an indicator variable (0 for no bankruptcies and 1 for at least 1 bankruptcy). We will refer to this modified dataset as <code>loans</code>.</p>
643
643
</div>
644
644
<p>Now, our goal is to create a model where <code>interest_rate</code> can be predicted using the variables <code>debt_to_income</code>, <code>term</code>, and <code>credit_checks</code>. As you learned in <ahref="model-mlr.html" class="quarto-xref"><span>Chapter 8</span></a>, least squares can be used to find the coefficient estimates for the linear model. The unknown population model can be written as:</p>
Figure 25.2: Two plots describing the total amount of money (USD) as a function of the total number of coins or low coins. As you might expect, the total amount of money is more highly postively correlated with the total number of coins than with the number of low coins.
770
+
Figure 25.2: Two plots describing the total amount of money (USD) as a function of the total number of coins or low coins. As you might expect, the total amount of money is more highly positively correlated with the total number of coins than with the number of low coins.
771
771
</figcaption></figure>
772
772
</div>
773
773
<p>Using the total <code>number_of_coins</code> as the predictor variable, <ahref="#tbl-coinhigh" class="quarto-xref">Table <span>25.2</span></a> provides the least squares estimate of the coefficient is 0.13. For every additional coin in the dish, we would predict that the student had US$0.13 more. The <spanclass="math inline">\(b_1 = 0.13\)</span> coefficient has a small p-value associated with it, suggesting we would not have seen data like this if <code>number_of_coins</code> and <code>total_amount</code> of money were not linearly related.</p>
<p>The <ahref="https://allisonhorst.github.io/palmerpenguins/articles/intro.html"><code>penguins</code></a> data can be found in the <ahref="https://github.com/allisonhorst/palmerpenguins"><strong>palmerpenguings</strong></a> R package.</p>
962
+
<p>The <ahref="https://allisonhorst.github.io/palmerpenguins/articles/intro.html"><code>penguins</code></a> data can be found in the <ahref="https://github.com/allisonhorst/palmerpenguins"><strong>palmerpenguins</strong></a> R package.</p>
963
963
</div>
964
964
<p>Our goal in this section is to compare two different regression models which both seek to predict the mass of an individual penguin in grams. The observations of three different penguin species include measurements on body size and sex. The data were collected by <ahref="https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php">Dr. Kristen Gorman</a> and the <ahref="https://pal.lternet.edu/">Palmer Station, Antarctica LTER</a> as part of the <ahref="https://lternet.edu/">Long Term Ecological Research Network</a>. <spanclass="citation" data-cites="Gorman:2014">(<ahref="references.html#ref-Gorman:2014" role="doc-biblioref">Gorman, Williams, and Fraser 2014</a>)</span> Although not exactly aligned with this research project, you might be able to imagine a setting where the dimensions of the penguin are known (through, for example, aerial photographs) but the mass is not known. The first model predicts <code>body_mass_g</code> by using only the <code>bill_length_mm</code>, a variable denoting the length of a penguin’s bill, in mm. The second model predicts <code>body_mass_g</code> by using <code>bill_length_mm</code>, <code>bill_depth_mm</code>, <code>flipper_length_mm</code>, <code>sex</code>, and <code>species</code>.</p>
<spanclass="header-section-number">25.3.1</span> Comparing two models to predict body mass in penguins</h3>
973
973
<p>The question we will seek to answer is whether the predictions of <code>body_mass_g</code> are substantially better when <code>bill_length_mm</code>, <code>bill_depth_mm</code>, <code>flipper_length_mm</code>, <code>sex</code>, and <code>species</code> are used in the model, as compared with a model on <code>bill_length_mm</code> only.</p>
974
-
<p>We refer to the model given with only <code>bill_lengh_mm</code> as the <strong>smaller</strong> model. It is seen in <ahref="#tbl-peng-lm-bill" class="quarto-xref">Table <span>25.5</span></a> with coefficient estimates of the parameters as well as standard errors and p-values. We refer to the model given with <code>bill_lengh_mm</code>, <code>bill_depth_mm</code>, <code>flipper_length_mm</code>, <code>sex</code>, and <code>species</code> as the <strong>larger</strong> model. It is seen in <ahref="#tbl-peng-lm-all" class="quarto-xref">Table <span>25.6</span></a> with coefficient estimates of the parameters as well as standard errors and p-values. Given what we know about high correlations between body measurements, it is somewhat unsurprising that all of the variables have low p-values, suggesting that each variable is a statistically discernible predictor of <code>body_mass_g</code>, given all other variables in the model. However, in this section, we will go beyond the use of p-values to consider independent predictions of <code>body_mass_g</code> as a way to compare the smaller and larger models.</p>
974
+
<p>We refer to the model given with only <code>bill_length_mm</code> as the <strong>smaller</strong> model. It is seen in <ahref="#tbl-peng-lm-bill" class="quarto-xref">Table <span>25.5</span></a> with coefficient estimates of the parameters as well as standard errors and p-values. We refer to the model given with <code>bill_length_mm</code>, <code>bill_depth_mm</code>, <code>flipper_length_mm</code>, <code>sex</code>, and <code>species</code> as the <strong>larger</strong> model. It is seen in <ahref="#tbl-peng-lm-all" class="quarto-xref">Table <span>25.6</span></a> with coefficient estimates of the parameters as well as standard errors and p-values. Given what we know about high correlations between body measurements, it is somewhat unsurprising that all of the variables have low p-values, suggesting that each variable is a statistically discernible predictor of <code>body_mass_g</code>, given all other variables in the model. However, in this section, we will go beyond the use of p-values to consider independent predictions of <code>body_mass_g</code> as a way to compare the smaller and larger models.</p>
0 commit comments