More typos

mine-cetinkaya-rundel · mine-cetinkaya-rundel · commit fdeee166e809 · 2023-02-22T02:17:36.000-05:00
Merge branch 'main' of https://github.com/OpenIntroStat/ims # Conflicts: # exercises/07-ex-model-slr.Rmd
diff --git a/.github/workflows/build_book.yaml b/.github/workflows/build_book.yaml
@@ -24,14 +24,14 @@ jobs:
         uses: actions/checkout@v2
 
       - name: Setup R
-        uses: r-lib/actions/setup-r@master
+        uses: r-lib/actions/setup-r@v2-branch
         
       - name: Install imagemagick
         run: |
           brew install imagemagick@6
 
       - name: Setup pandoc
-        uses: r-lib/actions/setup-pandoc@master
+        uses: r-lib/actions/setup-pandoc@v2-branch
         with:
           pandoc-version: '2.11.2'
           
diff --git a/02-data-design.Rmd b/02-data-design.Rmd
@@ -57,7 +57,7 @@ For the second and third questions above, identify the target population and wha
 In most statistical analysis procedures, the research question at hand boils down to understanding a numerical summary.
 The number (or set of numbers) may be a quantity you are already familiar with (like the average) or it may be something you learn through this text (like the slope and intercept from a least squares model, provided in Section \@ref(least-squares-regression)).
 
-A numerical summary can be calculated on either the sample of observation or the entire population.
+A numerical summary can be calculated on either the sample of observations or the entire population.
 However, measuring every unit in the population is usually prohibitive (so the parameter is very rarely calculated).
 So, a "typical" numerical summary is calculated from a sample.
 Yet, we can still conceptualize calculating the average income of all adults in Argentina.
diff --git a/03-data-applications.Rmd b/03-data-applications.Rmd
@@ -108,7 +108,7 @@ The strength variable is trickier to classify -- we can think of it as discrete
 One way of approaching this is thinking about whether the values the variable takes vary linearly, e.g., is the difference in strength between passwords with strength levels 8 and 9 the same as the difference with those with strength levels 9 and 10.
 If this is not necessarily the case, we would classify the variable as ordinal.
 Determining the classification of this variable requires understanding of how `strength` values were determined, which is a very typical workflow for working with data.
-Sometimes the data dictionary (presented in Table \@ref(tab:passwords-var-def) isn't sufficient, and we need to go back to the data source and try to understand the data better before we can proceed with the analysis meaningfully.
+Sometimes the data dictionary (presented in Table \@ref(tab:passwords-var-def)) isn't sufficient, and we need to go back to the data source and try to understand the data better before we can proceed with the analysis meaningfully.
 :::
 
 Next, let's try to get to know each variable a little bit better.
diff --git a/11-foundations-randomization.Rmd b/11-foundations-randomization.Rmd
@@ -663,7 +663,7 @@ In each of these examples, the **point estimate** of the difference in proportio
 
 When the p-value is small, i.e., less than a previously set threshold, we say the results are **statistically significant**\index{statistically significant}.
 This means the data provide such strong evidence against $H_0$ that we reject the null hypothesis in favor of the alternative hypothesis.
-The threshold, called the **significance level**\index{hypothesis testing!significance level}\index{significance level} and often represented by $\alpha$ (the Greek letter *alpha*).
+The threshold is called the **significance level**\index{hypothesis testing!significance level}\index{significance level} and often represented by $\alpha$ (the Greek letter *alpha*).
 The value of $\alpha$ represents how rare an event needs to be in order for the null hypothesis to be rejected.
 Historically, many fields have set $\alpha = 0.05,$ meaning that the results need to occur less than 5% of the time, if the null hypothesis is to be rejected.
 The value of $\alpha$ can vary depending on the the field or the application.
diff --git a/13-foundations-mathematical.Rmd b/13-foundations-mathematical.Rmd
@@ -1080,7 +1080,7 @@ stent30 %>%
   pivot_wider(names_from = outcome, values_from = n, values_fill = 0) %>%
   janitor::adorn_totals(where = c("row", "col")) %>%
   kbl(linesep = "", booktabs = TRUE, caption = "Descriptive statistics for 30-day results for the stent study.",
-      col.names = c("Group", "No event", "Stroke", "Total")) %>%
+      col.names = c("Group", "Stroke", "No event", "Total")) %>%
   kable_styling(bootstrap_options = c("striped", "condensed"), 
                 latex_options = c("striped", "hold_position"), full_width = FALSE) %>%
   column_spec(1:4, width = "7em")
diff --git a/15-foundations-applications.Rmd b/15-foundations-applications.Rmd
@@ -71,7 +71,7 @@ Reading through the different definitions and solidifying your understanding wil
 
 In this case study, we consider a new malaria vaccine called PfSPZ.
 In the malaria study, volunteer patients were randomized into one of two experiment groups: 14 patients received an experimental vaccine and 6 patients received a placebo vaccine.
-Nineteen weeks later, all 20 patients were exposed to a drug-sensitive malaria virus strain; the motivation of using a drug-sensitive strain of virus here is for ethical considerations, allowing any infections to be treated effectively.
+Nineteen weeks later, all 20 patients were exposed to a drug-sensitive strain of the malaria parasite; the motivation of using a drug-sensitive strain here is for ethical considerations, allowing any infections to be treated effectively.
 
 ::: {.data data-latex=""}
 The [`malaria`](http://openintrostat.github.io/openintro/reference/malaria.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.
diff --git a/17-inference-two-props.Rmd b/17-inference-two-props.Rmd
@@ -653,7 +653,7 @@ Thus, we do not observe benefits or harm from mammograms relative to a regular b
 Can we conclude that mammograms have no benefits or harm?
 Here are a few considerations to keep in mind when reviewing the mammogram study as well as any other medical study:
 
--   We do not accept the null hypothesis, which means we do not have sufficient evidence to conclude that mammograms reduce or increase breast cancer deaths.
+-   We do not reject the null hypothesis, which means we do not have sufficient evidence to conclude that mammograms reduce or increase breast cancer deaths.
 -   If mammograms are helpful or harmful, the data suggest the effect isn't very large.
 -   Are mammograms more or less expensive than a non-mammogram breast exam? If one option is much more expensive than the other and does not offer clear benefits, then we should lean towards the less expensive option.
 -   The study's authors also found that mammograms led to over-diagnosis of breast cancer, which means some breast cancers were found (or thought to be found) but that these cancers would not cause symptoms during patients' lifetimes. That is, something else would kill the patient before breast cancer symptoms appeared. This means some patients may have been treated for breast cancer unnecessarily, and this treatment is another cost to consider. It is also important to recognize that over-diagnosis can cause unnecessary physical or emotional harm to patients.
diff --git a/20-inference-two-means.Rmd b/20-inference-two-means.Rmd
@@ -154,7 +154,7 @@ Approximate the p-value depicted in Figure \@ref(fig:randexamspval), and provide
 
 ------------------------------------------------------------------------
 
-Using software, we can find the number of shuffled differences in means that are less than the observed difference (of 3.14) is 19 (out of 1,000 randomizations).
+Using software, we can find the number of shuffled differences in means that are less than the observed difference (of 3.14) is 900 (out of 1,000 randomizations).
 So 10% of the simulations are larger than the observed difference.
 To get the p-value, we double the proportion of randomized differences which are larger than the observed difference, p-value = 0.2.
 
diff --git a/24-inf-model-slr.Rmd b/24-inf-model-slr.Rmd
@@ -531,7 +531,7 @@ In context, we are 95% confident that for the model describing the population of
 ## Mathematical model for testing the slope {#mathslope}
 
 When certain technical conditions apply, it is convenient to use mathematical approximations to test and estimate the slope parameter.
-The approximations will build on the t-distribution which were described in Chapter \@ref(inference-one-mean).
+The approximations will build on the t-distribution which was described in Chapter \@ref(inference-one-mean).
 The mathematical model is often correct and is usually easy to implement computationally.
 The validity of the technical conditions will be considered in detail in Section \@ref(tech-cond-linmod).
 
@@ -799,7 +799,6 @@ However, there are other types of intervals that may be of interest, including p
 
 In the previous sections, we used randomization and bootstrapping to perform inference when the mathematical model was not valid due to violations of the technical conditions.
 In this section, we'll provide details for when the mathematical model is appropriate and a discussion of technical conditions needed for the randomization and bootstrapping procedures.
-.
 
 ```{r include=FALSE}
 terms_chp_24 <- c(terms_chp_24, "technical conditions linear regression")
diff --git a/25-inf-model-mlr.Rmd b/25-inf-model-mlr.Rmd
@@ -262,7 +262,7 @@ lm(total_amount ~ number_of_coins + number_of_low_coins, data = money) %>%
   column_spec(2:5, width = "5em")
 ```
 
-When working with multiple regression models, interpreting the model coefficient is mot always as straightforward as it was with the coin example.
+When working with multiple regression models, interpreting the model coefficient is not always as straightforward as it was with the coin example.
 However, we encourage you to always think carefully about the variables in the model, consider how they might be correlated among themselves, and work through different models to see how using different sets of variables might produce different relationships for predicting the response variable of interest.
 
 ::: {.important data-latex=""}
diff --git a/26-inf-model-logistic.Rmd b/26-inf-model-logistic.Rmd
@@ -55,7 +55,7 @@ email_variables %>%
 ## Model diagnostics
 
 Before looking at the hypothesis tests associated with the coefficients (turns out they are very similar to those in linear regression!), it is valuable to understand the technical conditions that underlie the inference applied to the logistic regression model.
-Generally, as you've seen in the logistic regression modeling examples, it is imperative that the response variable in binary.
+Generally, as you've seen in the logistic regression modeling examples, it is imperative that the response variable is binary.
 Additionally, the key technical condition for logistic regression has to do with the relationship between the predictor variables $(x_i$ values) and the probability the outcome will be a success.
 It turns out, the relationship is a specific functional form called a logit function, where ${\rm logit}(p) = \log_e(\frac{p}{1-p}).$ The function may feel complicated, and memorizing the formula of the logit is not necessary for understanding logistic regression.
 What you do need to remember is that the probability of the outcome being a success is a function of a linear combination of the explanatory variables.
@@ -275,7 +275,7 @@ The p-value is a probability measure under a setting of no relationship.
 That p-value provides information about the degree of the relationship (e.g., above we measure the relationship between `spam` and `to_multiple` using a p-value), but the p-value does not measure how well the model will predict the individual emails (e.g., the accuracy of the model).
 Depending on the goal of the research project, you might be inclined to focus on variable importance (through p-values) or you might be inclined to focus on prediction accuracy (through cross-validation).
 
-Here we present a method for using cross-validation accuracy to determine which variables (if any) should be used in a model which predicts whether an email is .
+Here we present a method for using cross-validation accuracy to determine which variables (if any) should be used in a model which predicts whether an email is spam.
 A full treatment of cross-validation and logistic regression models is beyond the scope of this text.
 Using cross-validation, we can build $k$ different models which are used to predict the observations in each of the $k$ holdout samples.
 The smaller model uses only the `to_multiple` variable, see the complete dataset (not cross-validated) model output in Table \@ref(tab:emaillogmodel1).