Skip to content

Commit 570b810

Browse files
committed
fixed some bugs with the column order and revised instructions
1 parent 85ca71d commit 570b810

3 files changed

Lines changed: 526 additions & 228 deletions

File tree

stat-R_2020/evaluation-m3-2020/NOM-PRENOM_evaluation-m3-2020.Rmd

Lines changed: 158 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ requiredLib <- c(
3535
"e1071",
3636
"rpart.plot",
3737
"randomForest",
38+
"grDevices", ## For densCols
3839
# "corrplot",
3940
"FactoMineR",
4041
"e1071",
@@ -65,6 +66,14 @@ options(scipen = 12) ## Max number of digits for non-scientific notation
6566
6667
```
6768

69+
```{r load_solutions}
70+
#### Just for the trainer: if the solution file exists, load it ####
71+
solutionFile <- "van-Helden-Jacques_evlaluation-m3_solutions.R"
72+
if (file.exists(solutionFile)) {
73+
read_chunk(solutionFile)
74+
}
75+
76+
```
6877

6978

7079
## Principle of the evaluation
@@ -89,13 +98,13 @@ You will then use your model to assign each sample to one of the 4 cancer types:
8998

9099
- **Reusability of the code**: the trainers should be able to run your R markdown on their instance of RStudio
91100

92-
- **Clarity of the code**: the code should be understandable by someone familiar with R programming. Variable names should indicate their content.
101+
- **Clarity of the code**: the code should be understandable for someone familiar with R programming. Variable names should indicate their content.
93102

94-
- **Code structuration**: the code should be well-structured. If you need to run several times the same code, try to encapsulate it in a function rather than duplicating pieces of code.
103+
- **Code structuration**: the code should be well structured. For example, if a given piece of code has to run at several places in your script, try to encapsulate it in a function rather than duplicating the code.
95104

96105
- **Code documentation**: the code should be documented by explaining what each piece of code aims at. The implementation choices should be documented.
97106

98-
- **Relevance of the statistical approaches**: the markdown should use the appropriate methods to address the biological questions. Don't hesitate to comment the basic assumptions underlying the different methods, and the adequacy of your data to these assumption.
107+
- **Relevance of the statistical approaches**: the markdown should explain why a given statistical method is appropriate to the addressed biological question. Don't hesitate to comment the basic assumptions underlying the different methods, and the adequacy of your data to these assumption.
99108

100109
- **Interpretation of the results**: for each task, we will ask you to interpret the results and to highlight their biological relevance.
101110

@@ -120,27 +129,31 @@ Your work will include the following tasks.
120129

121130
3. **Supervised classification**
122131

123-
- choose a supervised classification method among KNN, SVM, Random Forest (or any other method if you feel adventurous).
132+
- choose a supervised classification method among KNN, SVM, Random Forest, or any other method if you feel adventurous.
124133

125134

126135
## Specification of the working directories
127136

128-
We propose hereafter a piece of code to instantiate the working directories for this analysis in your account. In principle this code should work on any Unix-like operating system (Linux, Mac OX X)
137+
We propose hereafter a piece of code to instantiate the working directories for this analysis in your account. In principle this code should work on any Unix-like operating system (Linux, Mac OX X), it might require some slight adaptation for Windows operating systems.
129138

130139
```{r directories}
131140
## Create a vector with all the user-specific directories, which can be exported in the report afterwards
132141
dir <- vector()
133142
143+
## Specify your home directory
144+
dir["home"] <- "~" # This should probably be modified for Windows OS.
145+
134146
## Specify the local directory for the personal work
135-
dir["base"] <- "~/DUBii/m3-stat-R/personal-work" # Adapt to yours
147+
## Don't hesitate to modify this according to your own file organisation
148+
dir["base"] <- file.path(dir["home"], "DUBii", "m3-stat-R", "personal-work")
136149
137150
## Directory with the results of all analyses
138151
dir["results"] <- file.path(dir["base"], "results")
139152
dir.create(dir["results"], showWarnings = FALSE, recursive = TRUE)
140153
141154
## Specify a local directory to download the data
142155
dir["BICdata"] <- file.path(dir["base"], "data", "BIC")
143-
message("Creating local data directory for BIC data", dir["BICdata"])
156+
message("Local data directory for BIC data\t", dir["BICdata"])
144157
dir.create(dir["BICdata"], showWarnings = FALSE, recursive = TRUE)
145158
146159
## Print out a table with the working directories
@@ -152,11 +165,16 @@ kable(data.frame(dir), col.names = "Directory", caption = "Directories")
152165

153166
## Data set
154167

155-
The goal of this work is to develop statistical models to predict cancer types using data from the TCGA study (*The Cancer Genome Atlas*; https://cancergenome.nih.gov/) which includes RNA-seq data from breast cancer patients (Breast Invasice Cancer or BIC). There are two invasive cancer types involved : ductal and lobular carcinomas. Original papers for these studies are here: https://www.nature.com/articles/nature11412 et https://www.cell.com/cell/fulltext/S0092-8674(15)01195-2
168+
In this work we will develop statistical models to predict cancer types using data from the TCGA study (*The Cancer Genome Atlas*; https://cancergenome.nih.gov/) which includes RNA-seq data from breast cancer patients (**Breast Invasice Cancer** or **BIC**). There are two invasive cancer types involved: ductal and lobular carcinomas.
156169

157-
You will use the following data :
170+
Original papers for these studies are here:
158171

159-
* file `BIC_log2-norm-counts_edgeR_DEG_top_1000.tsv.gz` corresponds to expressions for the 1000 most significant genes (lines) for 819 samples (columns).
172+
- <https://www.nature.com/articles/nature11412>
173+
- <https://www.cell.com/cell/fulltext/S0092-8674(15)01195-2>
174+
175+
We prepared the files containing the BIC transcriptome profiles, and selected a subset of 1000 genes to reduce the size of the data files.
176+
177+
* file `BIC_log2-norm-counts_edgeR_DEG_top_1000.tsv.gz` corresponds to the RNA quantification in 819 samples (columns), for the 1000 most significant genes (lines) returned by differential analysis.
160178
* file `BIC_sample-classes.tsv.gz` contains the tags of the 819 samples.
161179
* file `BIC_edgeR_DEG_table.tsv.gz` contains the edgeR results of differential expression analysis.
162180

@@ -178,7 +196,7 @@ dataFiles <- c(
178196
for (dataFile in dataFiles) {
179197
localFile <- file.path(dir["BICdata"], dataFile)
180198
if (file.exists(localFile)) {
181-
message("File already there, not downloading ", localFile)
199+
message("File already here, not downloading\n\t", localFile, "\n")
182200
} else {
183201
githubURL <- file.path("https://github.com/DU-Bii/module-3-Stat-R/raw/master/stat-R_2020/data/TCGA_BIC_subset/", dataFile)
184202
message("Downloading data file from github:\n\t", githubURL)
@@ -193,55 +211,95 @@ for (dataFile in dataFiles) {
193211

194212
## Data preprocessing
195213

196-
The data was pre-processed in order to provide you with a dataset ready-to-use for the multivariate analyses tasks.
214+
The data was pre-processed in order to provide you with a dataset ready-to-use for the multivariate analysis tasks.
197215

198216
In short, the pre-processing included the following steps.
199217

200218
1. Download of the raw counts per gene from the ReCount database
219+
201220
2. Selection of the subset of samples belonging to the **Breast Invasive Cancer** (**BIC**) study.
202-
3. Assignment of a cancer type based on three immuno markers.
221+
222+
3. Assignment of a cancer type, based on three immuno markers (documented in the sample description table).
223+
203224
4. Filtering out of the "undetected" genes, i.e. genes having either zero counts in more than 95% of the samples, or a mean count $\se 10$ across all the BIC samples.
225+
204226
5. Count standardisation. To compensate for difference in sequencing depth, we applied a sample-wise standardisation with `DESeq2::estimateSizeFactors()`.
227+
205228
6. Log2 transformation. Standardised counts were log2-transformed in order to normalise the values. Log2-transformed data are more appropriate for clustering and supervised classification.
206229

207-
7. Feature selection. In order to select relevant subset of genes, we ran multi-group differential analysis with edgeR. Note that this analysis was led with the raw counts (edgeR and DESeq2 have their own built-in normalisation procedure, and should never by fed with normalised data).
230+
7. Feature selection. In order to select relevant subset of genes, we ran multi-group differential analysis with edgeR. Note that this analysis was led with the raw counts (edgeR and DESeq2 have their own built-in normalisation procedure, and should never be fed with normalised data).
208231

209232
Additional details and the full code used for preprocessing can be found on the DUBii study case repository: <https://du-bii.github.io/study-cases/Homo_sapiens/TCGA_study-case/import_TCGA_from_Recount.html>
210233

211-
### Multidimensional scaling
212-
213-
We propose three methods to reduce the data dimensions.
214-
215-
a. **Differentially Expressed Genes** (**DEG**). Selection of the most signficant genes reported by DESeq2 (multi-group differential analysis)
216-
b. **Variance-ordered**: genes were sorted by decreasing variance (unsupervised criterion)
217-
c. **PCs** Principal components.
218-
219-
You will use these respective datasets at different stages of your report.
220234

221235

222236
## Data loading
223237

224-
225-
226238
The data were loaded from the folder `r dir["BICdata"]`.
227239

228-
229240
```{r data_loading}
230241
## Load expression data
231-
BIC.expr <- read.table(file = file.path(dir["BICdata"], dataFiles["expression"]), header = TRUE)
242+
## Note: we use the option check.name=FALSE to avoid converting
243+
## hyphens to dot in the sample IDs (column names)
244+
BIC.expr <- read.table(
245+
file = file.path(dir["BICdata"], dataFiles["expression"]),
246+
check.names = FALSE, # Added by JvH 2020-07-24
247+
header = TRUE)
232248
# dim(BIC.expr)
233-
BIC.sample.classes <- read.table(file.path(dir["BICdata"], dataFiles["sample classes"]),header = TRUE)
234-
BIC.deg <- read.table(file.path(dir["BICdata"], dataFiles["DEG table"]),header = TRUE)
249+
# colnames(BIC.expr)
250+
251+
252+
## Load sample description table, whichb includes cancer class + immuno markers
253+
## Note: we set the option stringsAsFactor=FALSE to facilitate
254+
## subsequent use of the cancer.type column.
255+
BIC.sample.classes <- read.table(
256+
file.path(dir["BICdata"], dataFiles["sample classes"]),
257+
stringsAsFactors = FALSE, # Added by JvH 2020-07-24
258+
header = TRUE)
235259
# dim(BIC.sample.classes)
236260
# names(BIC.sample.classes)
261+
# head(BIC.sample.classes)
262+
# class(BIC.sample.classes$cancer.type)
263+
264+
## Define sample colors and 1-letter symbols
265+
class.symbols <- c(
266+
"Basal.like" = "B",
267+
"HER2pos" = "H",
268+
"Luminal.A" = "A",
269+
"Luminal.B" = "B",
270+
"Unclassified" = "U"
271+
)
272+
BIC.sample.classes$symbol <- class.symbols[BIC.sample.classes$cancer.type]
273+
table(BIC.sample.classes$symbol)
274+
275+
## Define a color for each cancer class and assign colors to samples accordingly.
276+
class.colors <- c(
277+
"Basal.like" = "brown",
278+
"HER2pos" = "darkgreen",
279+
"Luminal.A" = "blue",
280+
"Luminal.B" = "violet",
281+
"Unclassified" = "grey"
282+
)
283+
BIC.sample.classes$color <- class.colors[BIC.sample.classes$cancer.type]
284+
285+
286+
## Load differential expression analysis results from DESeq2
287+
BIC.deg <- read.table(file.path(dir["BICdata"], dataFiles["DEG table"]),
288+
header = TRUE)
289+
# head(BIC.deg)
237290
291+
#### Reorder samples in order to group them by class ####
292+
sampleNamesByClass <- rownames(BIC.sample.classes)[order(BIC.sample.classes$cancer.type)]
293+
length(sampleNamesByClass)
294+
BIC.expr <- BIC.expr[, sampleNamesByClass]
295+
BIC.sample.classes <- BIC.sample.classes[sampleNamesByClass, ]
238296
239297
kable(dataFiles, col.names = "File", caption = "Data files")
240298
241299
```
242300

243301

244-
The expression file contains `r nrow(BIC.expr)` rows (genes) x `r ncol(BIC.sample.classes)` columns (samples).
302+
The expression file contains `r nrow(BIC.expr)` rows (genes) x `r ncol(BIC.expr)` columns (samples).
245303

246304
The sample description table contains `r nrow(BIC.sample.classes)` rows (samples) x `r ncol(BIC.sample.classes)` columns (description fields). The first column indicates the cancer type, and the three following one indicate the values (positive/negative) for three marker genes used as diagnostic markers for the cancer type (*ER1*, *PR1* et *Her2*).
247305

@@ -253,13 +311,15 @@ kable(head(BIC.sample.classes, n = 10), caption = "First rows of the sample desc
253311
We can count the number of samples per cancer type
254312

255313
```{r class_summary}
256-
kable(sort(table(BIC.sample.classes$cancer.type), decreasing = TRUE), col.names = c("Cancer type" , "Nb samples"))
314+
kable(sort(table(BIC.sample.classes$cancer.type), decreasing = TRUE),
315+
col.names = c("Cancer type" , "Nb samples"))
257316
```
258317

259318

260-
Comme vous pouvez le voir, il y a 5 types de cancer, dont un "Unclassified". Lors de la prédiction du type de cancer, ce type risque de biaiser les résultats. Nous allons donc temporairement supprimer ces échantillons des jeux de données, mais nous reviendrons dessus à la fin du TP, car ils constituent une excellente application de l'apprentissage automatique. En effet, si un modèle (classifieur entraîné) s'avère donner de bons résultats lors de l'évaluation (training / testing), nous pourrons l'utiliser pour prédire le type de cancer de ces échantillons de type "Unclassified".
319+
There are `r length(unique(BIC.sample.classes$cancer.type))` cancer types, including a "Unclassified", for the samples having a combination of markers inconsistent with the defined subtypes. This might bias the result since this "class" is likely to contain a mixture of different cancer types. We will thus temporarily suppress these samples from the dataset, but we will come back to it at the end of the work, since they provide an excellent opportunity to run the classifier for predictive purpuses (assign unclassified samples to one of the training classes).
261320

262-
We suppress the `Unclassified` samples from the expression and sample description tables.
321+
322+
The code below creates a data set from which the `Unclassified` samples has been suppressed.
263323

264324

265325
```{r remove_unclassified_samples}
@@ -287,10 +347,26 @@ BIC.expr.filtered <- BIC.expr[, -ind.uncl]
287347

288348
## Data reduction
289349

290-
1. Sort the expression table by increasing values of the FDR (padj) column found in the differential expression table (variable `BIC.deg`). Note that DEG table contains `r nrow(BIC.deg)` genes, whereas the expression table was restricted to `r nrow(BIC.expr.filtered)` genes. You will thus have to use a trick to select the right genes and sort them. The result should be a table with `r nrow(BIC.expr.filtered)` rows (genes) and `r ncol(BIC.expr.filtered)` columns (samples), sorted by increasing FDR (not in the table).
350+
351+
We propose three methods to reduce the data dimensions.
352+
353+
a. **Differentially Expressed Genes** (**DEG**). Selection of the most signficant genes reported by DESeq2 (multi-group differential analysis). This has already been done for you, we provide the table with the 1000 top significant genes.
354+
355+
b. **Variance-ordered**: genes were sorted by decreasing variance (unsupervised criterion). To avoid handling big files, we will apply a re-ordering of the 1000 top-ranking DEG genes, and see whether the genes wiht the highest variance provide good classifiers.
356+
357+
c. **PCs** Principal components. We will use a restricted number of principal components and see if they capture a sufficient information to train a classifier.
358+
359+
You will use these respective datasets at different stages of your report.
360+
361+
362+
### DEG selection
363+
364+
The code below sorts the expression table by increasing values of the FDR (column `padj` of the table `BIC.deg`) found in the differential expression table (variable `BIC.deg`).
365+
366+
Note that DEG table contains `r nrow(BIC.deg)` genes, whereas the expression table was restricted to `r nrow(BIC.expr.filtered)` genes. We thus used a trick to select the right genes and sort them. The result is a table with `r nrow(BIC.expr.filtered)` rows (genes) and `r ncol(BIC.expr.filtered)` columns (samples), sorted by increasing FDR (not in the table).
291367

292368
```{r top_deg}
293-
## Select in hte DEG table (contianing n20,000 genes) the subset that is found in the expression table (1000 genes)
369+
## Select in the DEG table (that contains ~20,000 genes) the subset that is found in the expression table (1000 genes)
294370
BIC.deg.filtered <- subset(
295371
BIC.deg,
296372
row.names(BIC.deg) %in% row.names(BIC.expr.filtered))
@@ -307,19 +383,34 @@ BIC.expr.DEGsorted <- BIC.expr.filtered[geneOrder, ]
307383
308384
```
309385

310-
2. Create another table with the filtered expression table sorted by decreasing variance.
386+
### Variance-based ordering
387+
388+
It is now your turn to process the data. Create another table with the filtered expression table sorted by decreasing variance.
389+
390+
```{r variance_ordering, eval=TRUE}
391+
392+
```
393+
311394

312-
3. Run principal component analysis on the filtered expression table and create a separate table named `BIC.expr.PCs` with the coordinates of each sample in the principal component space.
395+
### Principal components
313396

314-
## Principal component analysis
315397

316-
Generate PC plots wiht the following components. Label the samples according to their annotated class.
398+
Run principal component analysis on the filtered expression table and create a separate table named `BIC.expr.PCs` with the coordinates of each sample in the principal component space.
399+
400+
```{r pc_computation}
401+
402+
```
403+
404+
405+
## PC plots
406+
407+
Generate PC plots with the following components. Label the samples according to their annotated class.
317408

318409
- PC2 vs PC1
319410
- PC4 vs PC3
320411
- PC6 vs PC5
321412

322-
```{r pca}
413+
```{r pc_plots}
323414
324415
## YOUR CODE SHOULD COME HERE
325416
@@ -384,19 +475,37 @@ Is there a good correspondence between sample clusters and annotated cancer typ
384475

385476
* The **testing set** is the set which will allow to estimate the unbiased performances of the models. It will be made of the remaining 1/3 of the samples.
386477

387-
1. Split the filtered data set into training (2/3) and a testing (1/3) subsets with a balanced representation of the cancer types (stratified subsampling).
478+
Split the filtered data set into training (2/3) and a testing (1/3) subsets with a balanced representation of the cancer types (stratified subsampling).
479+
480+
```{r train_test_sets}
388481
482+
```
389483

390484
### Evaluation of the performances
391485

392486
In this section, we will perform here a manual evaluation of the classifier in order to make sure we master the different steps. In the next section we will use the `tune()` utilities in order to identify the optimal parameter values.
393487

394488
1. Use the training subset to train a classifier of your choice (KNN, SVM, Random Forest or an other one if you feel adventurous).
395489

490+
```{r training}
491+
492+
```
493+
494+
396495
2. Use the resulting model to predict the cancer type for the samples of the testing set.
397496

497+
```{r testing}
498+
499+
```
500+
501+
398502
3. Build a confusion matrix to compare the predicted classes and annotated classes for the testing set.
399503

504+
```{r confusion_matrix}
505+
506+
```
507+
508+
400509
4. Compute the misclassification error rate (MER).
401510

402511

@@ -412,13 +521,13 @@ In this section, we will perform here a manual evaluation of the classifier in o
412521
Interpret the results: how well does the classifier perform? Is there a bias towards some cancer types?
413522

414523

415-
### Tuning of the parametes
524+
### Tuning of the parameters
416525

417526
Test the impact of a parameter on the performances by repeating the steps above with different values of this parameter. The parameter to be tested will depend on the algorithm you chose.
418527

419-
- `svm()`: test each one of the 4 supported kernels
420-
- `knn()`: test the impact of the number of neighbors
421-
- `randomForest()`: test the impact of the `mtry`parameter
528+
- `svm()`: test each one of the 4 supported kernels
529+
- `knn()`: test the impact of the number of neighbors
530+
- `randomForest()`: test the impact of the `mtry`parameter
422531

423532
Tips: `tuneRF()` or `tune.randomForest()`, `tune.knn()`, `tune.svm()`
424533

@@ -450,4 +559,8 @@ Interpret the results: do the tested parameters affect a lot the performances (a
450559

451560
Comment the general assignment of the unclassified samples to the different classes. Analyse the relationship between the immuno markers and the predicted classes.
452561

562+
## A general conclusion and perspectives
563+
564+
(a few sentences, no more)
565+
453566

0 commit comments

Comments
 (0)