Clement lee patch 1 (#110)

clement-lee · web-flow · commit 6150a876770f · 2024-03-12T13:27:10.000Z
* Create wisconsin.R

Copy from person repo to centralise all materials

* Create wisconsin.Rmd

Similar to wisconsin.R, for centralisation

* Add files via upload

Copy from chapter 4

* Update Chapter_04.md

Fix several minor syntax issues, and change absolute paths to relative

* Create analysis.R

* Update chapter_02.md

To align with chapter 5
diff --git a/docs/assets/images/Figure1.png b/docs/assets/images/Figure1.png
diff --git a/docs/chapters/Chapter_04.md b/docs/chapters/Chapter_04.md
@@ -52,8 +52,8 @@ install.packages(c(
 ### The Quarto Workflow
 Quarto simplifies document creation through a two-step process:
 
-1. **Code Execution**: Your Quarto document (`.qmd`) is first processed by ![**knitr**](https://yihui.org/knitr/), which executes the code chunks embedded within the document, producing an intermediate Markdown (`.md`) file that includes both the original code and its output.
-2. **Final Rendering**:  The Markdown file is then handed over to ![**pandoc**](https://pandoc.org), which converts it into the final document in the desired output format. This process is highly flexible, supporting a wide array of formats including HTML, PDF, and Word documents.
+1. **Code Execution**: Your Quarto document (`.qmd`) is first processed by [**knitr**](https://yihui.org/knitr/), which executes the code chunks embedded within the document, producing an intermediate Markdown (`.md`) file that includes both the original code and its output.
+2. **Final Rendering**:  The Markdown file is then handed over to [**pandoc**](https://pandoc.org), which converts it into the final document in the desired output format. This process is highly flexible, supporting a wide array of formats including HTML, PDF, and Word documents.
 
 *[Include a visual diagram of the Quarto workflow here]*
 
@@ -77,7 +77,7 @@ cancer_data$diagnosis <- as.factor(cancer_data$diagnosis)
 
 ### Creating Your First Quarto Document
 
-![Screenshot of my chart](https://github.com/elixir-europe-training/ELIXIR-TrP-LiterateProgrammingR-CodeRep/blob/main/docs/chapters/Figure1.png)
+![Screenshot of my chart](../assets/images/Figure1.png)
 
 *Figure A: An example of a Quarto document in RStudio, showcasing integrated code and results.*
 
@@ -120,9 +120,9 @@ This header sets up the document with a title, an author, specifies that code ec
 <summary><strong>📝 Exercise: Produce your own Quarto document</strong></summary>
 <p>
 
-1. Create a new Quarto document in your editor (e.g., RStudio).
-2. Add a title and your name as the author in the YAML header.
-3. Set the output format to HTML.
+1. Create a new Quarto document in your editor (e.g., RStudio).  
+2. Add a title and your name as the author in the YAML header.  
+3. Set the output format to HTML.  
 4. Render the document.
 
 </p>
@@ -132,9 +132,9 @@ This header sets up the document with a title, an author, specifies that code ec
 <summary><strong>✅ Solution:</strong></summary>
 <p>
 
-1. **Create Document**: In RStudio, use `File > New File > Quarto Document`.
-2. **Edit YAML Header**: Add `title: "Your Title"` and `author: "Your Name"`.
-3. **Set Output**: Ensure the YAML includes `format: html`.
+1. **Create Document**: In RStudio, use `File > New File > Quarto Document`.  
+2. **Edit YAML Header**: Add `title: "Your Title"` and `author: "Your Name"`.  
+3. **Set Output**: Ensure the YAML includes `format: html`.  
 4. **Render**: Click the "Render" button to produce your HTML document.
 
 *Insert screenshots demonstrating each step for clarity.*
@@ -449,8 +449,8 @@ format:
 editor: visual
 ```
    
-**Figure 2**: Example of how the Quarto HTML document head looks.
-![Quarto HTML Document Head](https://github.com/elixir-europe-training/ELIXIR-TrP-LiterateProgrammingR-CodeRep/blob/main/docs/chapters/Figure_CodRep/CodeRep1.png)
+**Figure 2**: Example of how the Quarto HTML document head looks.  
+![Quarto HTML Document Head](./Figure_CodRep/CodeRep1.png)
 
 </p>
 </details>
@@ -484,8 +484,8 @@ knitr::opts_chunk$set(fig.align = "center")
 <summary><strong>Solution</strong></summary>
 <p>
    
-**Figure 3**: Example of importing packages in Quarto.
-![Importing Packages in Quarto](https://github.com/elixir-europe-training/ELIXIR-TrP-LiterateProgrammingR-CodeRep/blob/main/docs/chapters/Figure_CodRep/CoderRep2.png)
+**Figure 3**: Example of importing packages in Quarto.  
+![Importing Packages in Quarto](./Figure_CodRep/CoderRep2.png)
 
 </p>
 </details>
@@ -577,7 +577,7 @@ ADD SCREENSHOT ON HOW IT LOOK
 <details>
 <summary><strong>Solution: Insert text and code (Step 2)</strong></summary>
 
-Statistical tests
+Statistical tests  
 
 We can perform two-sample *t*-test to find out if there is a significant difference in the distribution of a feature according to the tumour status.
 ```{r ttest, results = "hold"}
@@ -606,6 +606,7 @@ The results are in the following table:
 
 
 TO DO:
+
 1. fix further this part of the exercise, in content but also appearance
 2. add the entire render document at the very end 
 
diff --git a/docs/chapters/chapter_02.md b/docs/chapters/chapter_02.md
@@ -57,7 +57,7 @@ Now, it is common that the analysis changes direction as you go along, and/or th
 If you feel dissatisfied with this workflow, you will benefit from this training programme. You will be able to adopt a more efficient workflow that not only generates a deliverable with reproducible results, but also keeps track of the versions of the files so there won't be anything like `presentation-final-final-02.pdf`.
 
 ## 2.3 Literate Programming
-Practically, literate programming (almost) means merging the **.R** and .tex files in the old workflow. Let's start with a snippet of an R script:
+Practically, literate programming (almost) means merging the **.R** and .tex files in the old workflow. Let's start with a snippet of a non-literate-programming R script (you can access the full script [here](../scripts/analysis.R)):
 
 ```
 cancer_data <- read.csv("data/breast-cancer-wisconsin.csv") # load the data
diff --git a/docs/scripts/analysis.R b/docs/scripts/analysis.R
@@ -0,0 +1,129 @@
+## ----prelim, message = FALSE--------------------------------------------------
+library(tibble)
+library(dplyr)
+library(ggplot2)
+library(caret)
+library(ROCR)
+library(pROC)
+source("plots.R")
+source("models.R")
+theme_set(theme_bw(12))
+
+
+## ----read---------------------------------------------------------------------
+cancer_data <- as_tibble(read.csv("data/breast-cancer-wisconsin.csv"))
+head(cancer_data)
+cancer_data$diagnosis <- as.factor(cancer_data$diagnosis)
+colnames(cancer_data)
+dim(cancer_data)
+
+
+## ----remove_na----------------------------------------------------------------
+cancer_data <- cancer_data |> select(where(~ all(!is.na(.x))))
+head(cancer_data)
+
+
+## ----plot-counts, fig.cap = "Counts of data according to tumour status", fig.align = "center", out.width = "70%", fig.asp = 0.65, message = FALSE----
+plot_count(cancer_data)
+
+
+## ----plot-hist, fig.show = "hold", fig.cap = "Histogram and density of three features according to tumour status", fig.align = "center", out.width = "70%", fig.asp = 0.65, message = FALSE----
+plot_hist_den(
+  cancer_data, var = area_worst, label = "Area worst", filename = "hist_area.png"
+)
+plot_hist_den(
+  cancer_data, var = fractal_dimension_mean, label = "Fractal dimension mean",
+  filename = "hist_frac.png"
+)
+plot_hist_den(
+  cancer_data, var = radius_se, label = "Radius se", filename = "hist_radius.png"
+)
+
+
+## ----plot-boxplot, fig.cap = "Boxplot of three features according to tumour status", fig.show = "hold", fig.pos = "!h", fig.align = "center", out.width = "70%", fig.asp = 0.65, message = FALSE----
+plot_boxplot(
+  cancer_data, var = area_worst, label = "Area worst", filename = "boxplot_area.png"
+)
+plot_boxplot(
+  cancer_data, var = fractal_dimension_mean, label = "Fractal dimension mean",
+  filename = "boxplot_frac.png"
+)
+plot_boxplot(
+  cancer_data, var = radius_se, label = "Radius se", filename = "boxplot_radius.png"
+)
+
+
+## ----plot-smoothed, warning = FALSE, message = FALSE, fig.cap = "Tumour status against each feature with fitted logistic regression line", fig.show = "hold", fig.pos = "!h", fig.asp = 0.5, fig.align = "center", out.width = "70%", fig.asp = 0.65, message = FALSE----
+plot_logistic_smoothed(
+  cancer_data, var = area_worst, filename = "smoothed_area.png"
+)
+plot_logistic_smoothed(
+  cancer_data, var = fractal_dimension_mean, filename = "smoothed_frac.png"
+)
+plot_logistic_smoothed(
+  cancer_data, var = radius_se, filename = "smoothed_radius.png"
+)
+
+
+## ----ttest, results = "hold"--------------------------------------------------
+area_worst_B <- cancer_data$area_worst[cancer_data$diagnosis == "B"]
+area_worst_M <- cancer_data$area_worst[cancer_data$diagnosis == "M"]
+ttest0 <- t.test(area_worst_B, area_worst_M, var.equal = TRUE)
+options(scipen = 3, digits = 3)
+
+
+## ----ttest_functional---------------------------------------------------------
+ttest1 <- ttest_var(cancer_data, var = area_worst)
+ttest2 <- ttest_var(cancer_data, var = fractal_dimension_mean)
+ttest3 <- ttest_var(cancer_data, var = radius_se)
+
+
+## ----drop, results = "hold"---------------------------------------------------
+input_data <- cancer_data |> select(-id, -diagnosis)
+
+
+## ----plot-corr, fig.cap = "Correlation matrix heatmap", out.width = "80%", fig.align = "center", fig.asp = 0.65, message = FALSE----
+plot_heatmap(input_data, filename = "heatmap.png")
+
+
+## ----drop_corr----------------------------------------------------------------
+correlation_data <- remove_cols_high_cor(input_data)
+names(correlation_data)
+
+
+## ----cross_validation, warning = FALSE----------------------------------------
+list_model <- create_model(cancer_data)
+pred_y <- list_model$pred
+cancer_test <- list_model$test
+
+
+## ----confusion_matrix---------------------------------------------------------
+cm <- confusionMatrix(pred_y, cancer_test$diagnosis, positive = 'M')
+cm
+TN <- cm$table[1,1]
+TP <- cm$table[2,2]
+FN <- cm$table[2,1]
+FP <- cm$table[1,2]
+
+
+## ----classification_report----------------------------------------------------
+sensitivity(pred_y, cancer_test$diagnosis, positive = "M")
+specificity(pred_y, cancer_test$diagnosis, positive = "M")
+posPredValue(pred_y, cancer_test$diagnosis, positive = "M")
+negPredValue(pred_y, cancer_test$diagnosis, positive = "M")
+precision(pred_y, cancer_test$diagnosis, positive = "M")
+recall(pred_y, cancer_test$diagnosis, positive = "M")
+
+
+## ----auc----------------------------------------------------------------------
+pred_y_num <- as.numeric(pred_y)
+test_diagnosis_num <- as.numeric(cancer_test$diagnosis)
+auc_cancer <- auc(pred_y_num, test_diagnosis_num)
+
+
+## ----roc, fig.cap = "ROC curve for the logistic regression model. Source: Wisconsin Breast Cancer Dataset.", fig.align = "center", out.width = "70%", fig.asp = 0.65, message = FALSE----
+plot_roc(pred_y, cancer_test$diagnosis, filename = "roc.png")
+
+
+## ----session_info-------------------------------------------------------------
+sessionInfo()
diff --git a/docs/scripts/wisconsin.R b/docs/scripts/wisconsin.R
@@ -0,0 +1,129 @@
+## ----prelim, message = FALSE--------------------------------------------------
+library(tibble)
+library(dplyr)
+library(ggplot2)
+library(caret)
+library(ROCR)
+library(pROC)
+source("plots.R")
+source("models.R")
+theme_set(theme_bw(12))
+
+
+## ----read---------------------------------------------------------------------
+cancer_data <- as_tibble(read.csv("data/breast-cancer-wisconsin.csv"))
+head(cancer_data)
+cancer_data$diagnosis <- as.factor(cancer_data$diagnosis)
+colnames(cancer_data)
+dim(cancer_data)
+
+
+## ----remove_na----------------------------------------------------------------
+cancer_data <- cancer_data |> select(where(~ all(!is.na(.x))))
+head(cancer_data)
+
+
+## ----plot-counts, fig.cap = "Counts of data according to tumour status", fig.align = "center", out.width = "70%", fig.asp = 0.65, message = FALSE----
+plot_count(cancer_data)
+
+
+## ----plot-hist, fig.show = "hold", fig.cap = "Histogram and density of three features according to tumour status", fig.align = "center", out.width = "70%", fig.asp = 0.65, message = FALSE----
+plot_hist_den(
+  cancer_data, var = area_worst, label = "Area worst", filename = "hist_area.png"
+)
+plot_hist_den(
+  cancer_data, var = fractal_dimension_mean, label = "Fractal dimension mean",
+  filename = "hist_frac.png"
+)
+plot_hist_den(
+  cancer_data, var = radius_se, label = "Radius se", filename = "hist_radius.png"
+)
+
+
+## ----plot-boxplot, fig.cap = "Boxplot of three features according to tumour status", fig.show = "hold", fig.pos = "!h", fig.align = "center", out.width = "70%", fig.asp = 0.65, message = FALSE----
+plot_boxplot(
+  cancer_data, var = area_worst, label = "Area worst", filename = "boxplot_area.png"
+)
+plot_boxplot(
+  cancer_data, var = fractal_dimension_mean, label = "Fractal dimension mean",
+  filename = "boxplot_frac.png"
+)
+plot_boxplot(
+  cancer_data, var = radius_se, label = "Radius se", filename = "boxplot_radius.png"
+)
+
+
+## ----plot-smoothed, warning = FALSE, message = FALSE, fig.cap = "Tumour status against each feature with fitted logistic regression line", fig.show = "hold", fig.pos = "!h", fig.asp = 0.5, fig.align = "center", out.width = "70%", fig.asp = 0.65, message = FALSE----
+plot_logistic_smoothed(
+  cancer_data, var = area_worst, filename = "smoothed_area.png"
+)
+plot_logistic_smoothed(
+  cancer_data, var = fractal_dimension_mean, filename = "smoothed_frac.png"
+)
+plot_logistic_smoothed(
+  cancer_data, var = radius_se, filename = "smoothed_radius.png"
+)
+
+
+## ----ttest, results = "hold"--------------------------------------------------
+area_worst_B <- cancer_data$area_worst[cancer_data$diagnosis == "B"]
+area_worst_M <- cancer_data$area_worst[cancer_data$diagnosis == "M"]
+ttest0 <- t.test(area_worst_B, area_worst_M, var.equal = TRUE)
+options(scipen = 3, digits = 3)
+
+
+## ----ttest_functional---------------------------------------------------------
+ttest1 <- ttest_var(cancer_data, var = area_worst)
+ttest2 <- ttest_var(cancer_data, var = fractal_dimension_mean)
+ttest3 <- ttest_var(cancer_data, var = radius_se)
+
+
+## ----drop, results = "hold"---------------------------------------------------
+input_data <- cancer_data |> select(-id, -diagnosis)
+
+
+## ----plot-corr, fig.cap = "Correlation matrix heatmap", out.width = "80%", fig.align = "center", fig.asp = 0.65, message = FALSE----
+plot_heatmap(input_data, filename = "heatmap.png")
+
+
+## ----drop_corr----------------------------------------------------------------
+correlation_data <- remove_cols_high_cor(input_data)
+names(correlation_data)
+
+
+## ----cross_validation, warning = FALSE----------------------------------------
+list_model <- create_model(cancer_data)
+pred_y <- list_model$pred
+cancer_test <- list_model$test
+
+
+## ----confusion_matrix---------------------------------------------------------
+cm <- confusionMatrix(pred_y, cancer_test$diagnosis, positive = 'M')
+cm
+TN <- cm$table[1,1]
+TP <- cm$table[2,2]
+FN <- cm$table[2,1]
+FP <- cm$table[1,2]
+
+
+## ----classification_report----------------------------------------------------
+sensitivity(pred_y, cancer_test$diagnosis, positive = "M")
+specificity(pred_y, cancer_test$diagnosis, positive = "M")
+posPredValue(pred_y, cancer_test$diagnosis, positive = "M")
+negPredValue(pred_y, cancer_test$diagnosis, positive = "M")
+precision(pred_y, cancer_test$diagnosis, positive = "M")
+recall(pred_y, cancer_test$diagnosis, positive = "M")
+
+
+## ----auc----------------------------------------------------------------------
+pred_y_num <- as.numeric(pred_y)
+test_diagnosis_num <- as.numeric(cancer_test$diagnosis)
+auc_cancer <- auc(pred_y_num, test_diagnosis_num)
+
+
+## ----roc, fig.cap = "ROC curve for the logistic regression model. Source: Wisconsin Breast Cancer Dataset.", fig.align = "center", out.width = "70%", fig.asp = 0.65, message = FALSE----
+plot_roc(pred_y, cancer_test$diagnosis, filename = "roc.png")
+
+
+## ----session_info-------------------------------------------------------------
+sessionInfo()
diff --git a/docs/scripts/wisconsin.Rmd b/docs/scripts/wisconsin.Rmd