Skip to content

Commit 6150a87

Browse files
authored
Clement lee patch 1 (#110)
* Create wisconsin.R Copy from person repo to centralise all materials * Create wisconsin.Rmd Similar to wisconsin.R, for centralisation * Add files via upload Copy from chapter 4 * Update Chapter_04.md Fix several minor syntax issues, and change absolute paths to relative * Create analysis.R * Update chapter_02.md To align with chapter 5
1 parent 914055a commit 6150a87

6 files changed

Lines changed: 497 additions & 15 deletions

File tree

docs/assets/images/Figure1.png

129 KB
Loading

docs/chapters/Chapter_04.md

Lines changed: 15 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -52,8 +52,8 @@ install.packages(c(
5252
### The Quarto Workflow
5353
Quarto simplifies document creation through a two-step process:
5454

55-
1. **Code Execution**: Your Quarto document (`.qmd`) is first processed by ![**knitr**](https://yihui.org/knitr/), which executes the code chunks embedded within the document, producing an intermediate Markdown (`.md`) file that includes both the original code and its output.
56-
2. **Final Rendering**: The Markdown file is then handed over to ![**pandoc**](https://pandoc.org), which converts it into the final document in the desired output format. This process is highly flexible, supporting a wide array of formats including HTML, PDF, and Word documents.
55+
1. **Code Execution**: Your Quarto document (`.qmd`) is first processed by [**knitr**](https://yihui.org/knitr/), which executes the code chunks embedded within the document, producing an intermediate Markdown (`.md`) file that includes both the original code and its output.
56+
2. **Final Rendering**: The Markdown file is then handed over to [**pandoc**](https://pandoc.org), which converts it into the final document in the desired output format. This process is highly flexible, supporting a wide array of formats including HTML, PDF, and Word documents.
5757

5858
*[Include a visual diagram of the Quarto workflow here]*
5959

@@ -77,7 +77,7 @@ cancer_data$diagnosis <- as.factor(cancer_data$diagnosis)
7777

7878
### Creating Your First Quarto Document
7979

80-
![Screenshot of my chart](https://github.com/elixir-europe-training/ELIXIR-TrP-LiterateProgrammingR-CodeRep/blob/main/docs/chapters/Figure1.png)
80+
![Screenshot of my chart](../assets/images/Figure1.png)
8181

8282
*Figure A: An example of a Quarto document in RStudio, showcasing integrated code and results.*
8383

@@ -120,9 +120,9 @@ This header sets up the document with a title, an author, specifies that code ec
120120
<summary><strong>📝 Exercise: Produce your own Quarto document</strong></summary>
121121
<p>
122122

123-
1. Create a new Quarto document in your editor (e.g., RStudio).
124-
2. Add a title and your name as the author in the YAML header.
125-
3. Set the output format to HTML.
123+
1. Create a new Quarto document in your editor (e.g., RStudio).
124+
2. Add a title and your name as the author in the YAML header.
125+
3. Set the output format to HTML.
126126
4. Render the document.
127127

128128
</p>
@@ -132,9 +132,9 @@ This header sets up the document with a title, an author, specifies that code ec
132132
<summary><strong>✅ Solution:</strong></summary>
133133
<p>
134134

135-
1. **Create Document**: In RStudio, use `File > New File > Quarto Document`.
136-
2. **Edit YAML Header**: Add `title: "Your Title"` and `author: "Your Name"`.
137-
3. **Set Output**: Ensure the YAML includes `format: html`.
135+
1. **Create Document**: In RStudio, use `File > New File > Quarto Document`.
136+
2. **Edit YAML Header**: Add `title: "Your Title"` and `author: "Your Name"`.
137+
3. **Set Output**: Ensure the YAML includes `format: html`.
138138
4. **Render**: Click the "Render" button to produce your HTML document.
139139

140140
*Insert screenshots demonstrating each step for clarity.*
@@ -449,8 +449,8 @@ format:
449449
editor: visual
450450
```
451451
452-
**Figure 2**: Example of how the Quarto HTML document head looks.
453-
![Quarto HTML Document Head](https://github.com/elixir-europe-training/ELIXIR-TrP-LiterateProgrammingR-CodeRep/blob/main/docs/chapters/Figure_CodRep/CodeRep1.png)
452+
**Figure 2**: Example of how the Quarto HTML document head looks.
453+
![Quarto HTML Document Head](./Figure_CodRep/CodeRep1.png)
454454
455455
</p>
456456
</details>
@@ -484,8 +484,8 @@ knitr::opts_chunk$set(fig.align = "center")
484484
<summary><strong>Solution</strong></summary>
485485
<p>
486486

487-
**Figure 3**: Example of importing packages in Quarto.
488-
![Importing Packages in Quarto](https://github.com/elixir-europe-training/ELIXIR-TrP-LiterateProgrammingR-CodeRep/blob/main/docs/chapters/Figure_CodRep/CoderRep2.png)
487+
**Figure 3**: Example of importing packages in Quarto.
488+
![Importing Packages in Quarto](./Figure_CodRep/CoderRep2.png)
489489

490490
</p>
491491
</details>
@@ -577,7 +577,7 @@ ADD SCREENSHOT ON HOW IT LOOK
577577
<details>
578578
<summary><strong>Solution: Insert text and code (Step 2)</strong></summary>
579579

580-
Statistical tests
580+
Statistical tests
581581

582582
We can perform two-sample *t*-test to find out if there is a significant difference in the distribution of a feature according to the tumour status.
583583
```{r ttest, results = "hold"}
@@ -606,6 +606,7 @@ The results are in the following table:
606606

607607

608608
TO DO:
609+
609610
1. fix further this part of the exercise, in content but also appearance
610611
2. add the entire render document at the very end
611612

docs/chapters/chapter_02.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ Now, it is common that the analysis changes direction as you go along, and/or th
5757
If you feel dissatisfied with this workflow, you will benefit from this training programme. You will be able to adopt a more efficient workflow that not only generates a deliverable with reproducible results, but also keeps track of the versions of the files so there won't be anything like `presentation-final-final-02.pdf`.
5858

5959
## 2.3 Literate Programming
60-
Practically, literate programming (almost) means merging the **.R** and .tex files in the old workflow. Let's start with a snippet of an R script:
60+
Practically, literate programming (almost) means merging the **.R** and .tex files in the old workflow. Let's start with a snippet of a non-literate-programming R script (you can access the full script [here](../scripts/analysis.R)):
6161

6262
```
6363
cancer_data <- read.csv("data/breast-cancer-wisconsin.csv") # load the data

docs/scripts/analysis.R

Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
## ----prelim, message = FALSE--------------------------------------------------
2+
library(tibble)
3+
library(dplyr)
4+
library(ggplot2)
5+
library(caret)
6+
library(ROCR)
7+
library(pROC)
8+
source("plots.R")
9+
source("models.R")
10+
theme_set(theme_bw(12))
11+
12+
13+
## ----read---------------------------------------------------------------------
14+
cancer_data <- as_tibble(read.csv("data/breast-cancer-wisconsin.csv"))
15+
head(cancer_data)
16+
cancer_data$diagnosis <- as.factor(cancer_data$diagnosis)
17+
colnames(cancer_data)
18+
dim(cancer_data)
19+
20+
21+
## ----remove_na----------------------------------------------------------------
22+
cancer_data <- cancer_data |> select(where(~ all(!is.na(.x))))
23+
head(cancer_data)
24+
25+
26+
## ----plot-counts, fig.cap = "Counts of data according to tumour status", fig.align = "center", out.width = "70%", fig.asp = 0.65, message = FALSE----
27+
plot_count(cancer_data)
28+
29+
30+
## ----plot-hist, fig.show = "hold", fig.cap = "Histogram and density of three features according to tumour status", fig.align = "center", out.width = "70%", fig.asp = 0.65, message = FALSE----
31+
plot_hist_den(
32+
cancer_data, var = area_worst, label = "Area worst", filename = "hist_area.png"
33+
)
34+
plot_hist_den(
35+
cancer_data, var = fractal_dimension_mean, label = "Fractal dimension mean",
36+
filename = "hist_frac.png"
37+
)
38+
plot_hist_den(
39+
cancer_data, var = radius_se, label = "Radius se", filename = "hist_radius.png"
40+
)
41+
42+
43+
## ----plot-boxplot, fig.cap = "Boxplot of three features according to tumour status", fig.show = "hold", fig.pos = "!h", fig.align = "center", out.width = "70%", fig.asp = 0.65, message = FALSE----
44+
plot_boxplot(
45+
cancer_data, var = area_worst, label = "Area worst", filename = "boxplot_area.png"
46+
)
47+
plot_boxplot(
48+
cancer_data, var = fractal_dimension_mean, label = "Fractal dimension mean",
49+
filename = "boxplot_frac.png"
50+
)
51+
plot_boxplot(
52+
cancer_data, var = radius_se, label = "Radius se", filename = "boxplot_radius.png"
53+
)
54+
55+
56+
## ----plot-smoothed, warning = FALSE, message = FALSE, fig.cap = "Tumour status against each feature with fitted logistic regression line", fig.show = "hold", fig.pos = "!h", fig.asp = 0.5, fig.align = "center", out.width = "70%", fig.asp = 0.65, message = FALSE----
57+
plot_logistic_smoothed(
58+
cancer_data, var = area_worst, filename = "smoothed_area.png"
59+
)
60+
plot_logistic_smoothed(
61+
cancer_data, var = fractal_dimension_mean, filename = "smoothed_frac.png"
62+
)
63+
plot_logistic_smoothed(
64+
cancer_data, var = radius_se, filename = "smoothed_radius.png"
65+
)
66+
67+
68+
## ----ttest, results = "hold"--------------------------------------------------
69+
area_worst_B <- cancer_data$area_worst[cancer_data$diagnosis == "B"]
70+
area_worst_M <- cancer_data$area_worst[cancer_data$diagnosis == "M"]
71+
ttest0 <- t.test(area_worst_B, area_worst_M, var.equal = TRUE)
72+
options(scipen = 3, digits = 3)
73+
74+
75+
## ----ttest_functional---------------------------------------------------------
76+
ttest1 <- ttest_var(cancer_data, var = area_worst)
77+
ttest2 <- ttest_var(cancer_data, var = fractal_dimension_mean)
78+
ttest3 <- ttest_var(cancer_data, var = radius_se)
79+
80+
81+
## ----drop, results = "hold"---------------------------------------------------
82+
input_data <- cancer_data |> select(-id, -diagnosis)
83+
84+
85+
## ----plot-corr, fig.cap = "Correlation matrix heatmap", out.width = "80%", fig.align = "center", fig.asp = 0.65, message = FALSE----
86+
plot_heatmap(input_data, filename = "heatmap.png")
87+
88+
89+
## ----drop_corr----------------------------------------------------------------
90+
correlation_data <- remove_cols_high_cor(input_data)
91+
names(correlation_data)
92+
93+
94+
## ----cross_validation, warning = FALSE----------------------------------------
95+
list_model <- create_model(cancer_data)
96+
pred_y <- list_model$pred
97+
cancer_test <- list_model$test
98+
99+
100+
## ----confusion_matrix---------------------------------------------------------
101+
cm <- confusionMatrix(pred_y, cancer_test$diagnosis, positive = 'M')
102+
cm
103+
TN <- cm$table[1,1]
104+
TP <- cm$table[2,2]
105+
FN <- cm$table[2,1]
106+
FP <- cm$table[1,2]
107+
108+
109+
## ----classification_report----------------------------------------------------
110+
sensitivity(pred_y, cancer_test$diagnosis, positive = "M")
111+
specificity(pred_y, cancer_test$diagnosis, positive = "M")
112+
posPredValue(pred_y, cancer_test$diagnosis, positive = "M")
113+
negPredValue(pred_y, cancer_test$diagnosis, positive = "M")
114+
precision(pred_y, cancer_test$diagnosis, positive = "M")
115+
recall(pred_y, cancer_test$diagnosis, positive = "M")
116+
117+
118+
## ----auc----------------------------------------------------------------------
119+
pred_y_num <- as.numeric(pred_y)
120+
test_diagnosis_num <- as.numeric(cancer_test$diagnosis)
121+
auc_cancer <- auc(pred_y_num, test_diagnosis_num)
122+
123+
124+
## ----roc, fig.cap = "ROC curve for the logistic regression model. Source: Wisconsin Breast Cancer Dataset.", fig.align = "center", out.width = "70%", fig.asp = 0.65, message = FALSE----
125+
plot_roc(pred_y, cancer_test$diagnosis, filename = "roc.png")
126+
127+
128+
## ----session_info-------------------------------------------------------------
129+
sessionInfo()

docs/scripts/wisconsin.R

Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
## ----prelim, message = FALSE--------------------------------------------------
2+
library(tibble)
3+
library(dplyr)
4+
library(ggplot2)
5+
library(caret)
6+
library(ROCR)
7+
library(pROC)
8+
source("plots.R")
9+
source("models.R")
10+
theme_set(theme_bw(12))
11+
12+
13+
## ----read---------------------------------------------------------------------
14+
cancer_data <- as_tibble(read.csv("data/breast-cancer-wisconsin.csv"))
15+
head(cancer_data)
16+
cancer_data$diagnosis <- as.factor(cancer_data$diagnosis)
17+
colnames(cancer_data)
18+
dim(cancer_data)
19+
20+
21+
## ----remove_na----------------------------------------------------------------
22+
cancer_data <- cancer_data |> select(where(~ all(!is.na(.x))))
23+
head(cancer_data)
24+
25+
26+
## ----plot-counts, fig.cap = "Counts of data according to tumour status", fig.align = "center", out.width = "70%", fig.asp = 0.65, message = FALSE----
27+
plot_count(cancer_data)
28+
29+
30+
## ----plot-hist, fig.show = "hold", fig.cap = "Histogram and density of three features according to tumour status", fig.align = "center", out.width = "70%", fig.asp = 0.65, message = FALSE----
31+
plot_hist_den(
32+
cancer_data, var = area_worst, label = "Area worst", filename = "hist_area.png"
33+
)
34+
plot_hist_den(
35+
cancer_data, var = fractal_dimension_mean, label = "Fractal dimension mean",
36+
filename = "hist_frac.png"
37+
)
38+
plot_hist_den(
39+
cancer_data, var = radius_se, label = "Radius se", filename = "hist_radius.png"
40+
)
41+
42+
43+
## ----plot-boxplot, fig.cap = "Boxplot of three features according to tumour status", fig.show = "hold", fig.pos = "!h", fig.align = "center", out.width = "70%", fig.asp = 0.65, message = FALSE----
44+
plot_boxplot(
45+
cancer_data, var = area_worst, label = "Area worst", filename = "boxplot_area.png"
46+
)
47+
plot_boxplot(
48+
cancer_data, var = fractal_dimension_mean, label = "Fractal dimension mean",
49+
filename = "boxplot_frac.png"
50+
)
51+
plot_boxplot(
52+
cancer_data, var = radius_se, label = "Radius se", filename = "boxplot_radius.png"
53+
)
54+
55+
56+
## ----plot-smoothed, warning = FALSE, message = FALSE, fig.cap = "Tumour status against each feature with fitted logistic regression line", fig.show = "hold", fig.pos = "!h", fig.asp = 0.5, fig.align = "center", out.width = "70%", fig.asp = 0.65, message = FALSE----
57+
plot_logistic_smoothed(
58+
cancer_data, var = area_worst, filename = "smoothed_area.png"
59+
)
60+
plot_logistic_smoothed(
61+
cancer_data, var = fractal_dimension_mean, filename = "smoothed_frac.png"
62+
)
63+
plot_logistic_smoothed(
64+
cancer_data, var = radius_se, filename = "smoothed_radius.png"
65+
)
66+
67+
68+
## ----ttest, results = "hold"--------------------------------------------------
69+
area_worst_B <- cancer_data$area_worst[cancer_data$diagnosis == "B"]
70+
area_worst_M <- cancer_data$area_worst[cancer_data$diagnosis == "M"]
71+
ttest0 <- t.test(area_worst_B, area_worst_M, var.equal = TRUE)
72+
options(scipen = 3, digits = 3)
73+
74+
75+
## ----ttest_functional---------------------------------------------------------
76+
ttest1 <- ttest_var(cancer_data, var = area_worst)
77+
ttest2 <- ttest_var(cancer_data, var = fractal_dimension_mean)
78+
ttest3 <- ttest_var(cancer_data, var = radius_se)
79+
80+
81+
## ----drop, results = "hold"---------------------------------------------------
82+
input_data <- cancer_data |> select(-id, -diagnosis)
83+
84+
85+
## ----plot-corr, fig.cap = "Correlation matrix heatmap", out.width = "80%", fig.align = "center", fig.asp = 0.65, message = FALSE----
86+
plot_heatmap(input_data, filename = "heatmap.png")
87+
88+
89+
## ----drop_corr----------------------------------------------------------------
90+
correlation_data <- remove_cols_high_cor(input_data)
91+
names(correlation_data)
92+
93+
94+
## ----cross_validation, warning = FALSE----------------------------------------
95+
list_model <- create_model(cancer_data)
96+
pred_y <- list_model$pred
97+
cancer_test <- list_model$test
98+
99+
100+
## ----confusion_matrix---------------------------------------------------------
101+
cm <- confusionMatrix(pred_y, cancer_test$diagnosis, positive = 'M')
102+
cm
103+
TN <- cm$table[1,1]
104+
TP <- cm$table[2,2]
105+
FN <- cm$table[2,1]
106+
FP <- cm$table[1,2]
107+
108+
109+
## ----classification_report----------------------------------------------------
110+
sensitivity(pred_y, cancer_test$diagnosis, positive = "M")
111+
specificity(pred_y, cancer_test$diagnosis, positive = "M")
112+
posPredValue(pred_y, cancer_test$diagnosis, positive = "M")
113+
negPredValue(pred_y, cancer_test$diagnosis, positive = "M")
114+
precision(pred_y, cancer_test$diagnosis, positive = "M")
115+
recall(pred_y, cancer_test$diagnosis, positive = "M")
116+
117+
118+
## ----auc----------------------------------------------------------------------
119+
pred_y_num <- as.numeric(pred_y)
120+
test_diagnosis_num <- as.numeric(cancer_test$diagnosis)
121+
auc_cancer <- auc(pred_y_num, test_diagnosis_num)
122+
123+
124+
## ----roc, fig.cap = "ROC curve for the logistic regression model. Source: Wisconsin Breast Cancer Dataset.", fig.align = "center", out.width = "70%", fig.asp = 0.65, message = FALSE----
125+
plot_roc(pred_y, cancer_test$diagnosis, filename = "roc.png")
126+
127+
128+
## ----session_info-------------------------------------------------------------
129+
sessionInfo()

0 commit comments

Comments
 (0)