Skip to content

Commit db1dd30

Browse files
committed
Visualize a single decision tree
1 parent 760c95d commit db1dd30

1 file changed

Lines changed: 47 additions & 7 deletions

File tree

lessons/13_machine_learning.qmd

Lines changed: 47 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,9 @@ _Image source: [10x Genomics](https://www.10xgenomics.com/blog/your-introduction
6969
::: columns
7070

7171
::: column
72-
The human cortex is a great use case for this technology because the brain is divvied into layers 1-6 and a white matter layer. A good analogy would be to think of them as onion layers, where the layers are stacked right on top of each other spatially (x, y coordinates). There are relatively clear boundaries between each layer.
72+
The human cortex is a great use case for this technology because the brain is divvied into layers 1-6 and a white matter layer. A good analogy would be to think of them as an onion, where the layers are stacked right on top of each other spatially (x, y coordinates).
73+
74+
The layers are relatively distinct in their spatial locations but also have genes that are expressed highly in one layer and not the others. This makes it a great use case for machine learning because we can use both the spatial location and gene expression to predict which layer a cell belongs to.
7375
:::
7476

7577
::: column
@@ -83,14 +85,12 @@ _Image source: [Rai et al. (2026)](https://www.biorxiv.org/content/10.64898/2026
8385

8486
:::
8587

86-
**We will be using this coritcal dataset with labelled cortical layers to train a random forest classifier to predict the cortical layer labels of cells.**
88+
**We will be using a synthetic cortical dataset with labelled cortical layers to train a random forest classifier to predict which layer a cell belongs to.**
8789

8890

8991
### Cortical information
9092

91-
We have created a made-up dataset (as the data has not been published yet) based upon [this dataset](https://www.biorxiv.org/content/10.64898/2026.01.12.698703v1.full). Where layers are broken into 6 cortical layers (L1, L2, L3, L4, L5, L6) and a white matter layer. Each of these layers has a unique spatial location.
92-
93-
The dataset contains spatial coordinates of cells in the cortex, as well as the cortical layer that each cell belongs to.
93+
We have created a made-up dataset (as the data has not been published yet) based upon [this dataset](https://www.biorxiv.org/content/10.64898/2026.01.12.698703v1.full). Where layers are broken into 6 cortical layers (L1, L2, L3, L4, L5, L6) and a white matter layer. The dataset contains spatial coordinates of cells in the cortex, as well as the cortical layer that each cell belongs to.
9494

9595
```{python}
9696
#| label: tbl-load_cortical_data
@@ -104,6 +104,7 @@ from sklearn.ensemble import RandomForestClassifier
104104
from sklearn.model_selection import train_test_split
105105
from sklearn.metrics import accuracy_score, confusion_matrix
106106
107+
# Load synthetic cortical dataset
107108
df_cortical = pd.read_csv("data/synthetic_cortex_data.csv")
108109
df_cortical.head()
109110
```
@@ -135,8 +136,9 @@ plt.legend(title="Cortical Layer")
135136
plt.show()
136137
```
137138

138-
However, just using the x and y coordinates of each spot does not adequately represent the complexity of the data. We also have the expression of various genes for each spot as well. Lucky for us, the different cortical layers have known genes that are highly expressed in distinct layers. This is an additional pieces of information that we can use to predict
139+
So now we have a better idea of what the cross-section of the cortex looks like and where the different layers are located.
139140

141+
However, just using the x and y coordinates of each spot is not enough. You may have noticed that there apepars to be some mixing of layers near the boundaries. Luckily for us, the different cortical layers have known genes that are highly expressed in distinct layers. We can take a quick look at some canonical markers that are used to identify the different cortical layers:
140142

141143
::: {#fig-cortical_marker_genes .figure}
142144
![](../img/paper_cortical_markers.png){width=550}
@@ -145,7 +147,7 @@ Example of the spatial expression of known marker genes for each cortical layer.
145147
_Image source: [Rai et al. (2026)](https://www.biorxiv.org/content/10.64898/2026.01.12.698703v1.full)_
146148
:::
147149

148-
In the dataframe, you have have noted that we also have columns: `AQP4`, `HPCAL1`, `FREM3`, `TRABD2A`, `KRT17`, and `MOBP`. These are the log-normalized expression values for those genes in each cell. These genes are known to be highly expressed in specific cortical layers, so they can be used as markers to identify which layer a cell belongs to based on its gene expression profile. Once again we can visualize the expression of these marker genes across the cortex to see how they are distributed across the different layers:
150+
In the dataset, you have have noted that we also have columns: `AQP4`, `HPCAL1`, `FREM3`, `TRABD2A`, `KRT17`, and `MOBP`. These are the log-normalized expression values for those genes in each cell. Similar to the figure from above, we can visualize the expression of these marker genes in each cell (point) across the cortex to see the pattern of values across the different layers. This will give us a better idea of how we can use both the spatial location and gene expression to predict which layer a cell belongs to.
149151

150152
```{python}
151153
#| label: fig-cortical_marker_genes
@@ -176,6 +178,14 @@ plt.tight_layout()
176178
plt.show()
177179
```
178180

181+
::: {.callout-note collapse="true"}
182+
# Making multiple subplots in a loop
183+
In the above code, we first initialized a plot with 2 rows and 3 columns - with the goal of plotting each of the 6 genes in our dataset. This creates an **array of plots** which we can then access to generate each of our plots.
184+
185+
So we could have proceeded using the `[]` indexing we have been using for matrices, but that is more complex as we would need to keep track of both rows and columns in the for loop. Instead, we can use the `flatten()` method to convert this **2D array of plots into a list of plots**, which is easier to index in the for loop.
186+
187+
:::
188+
179189
**We will be using this synthetic dataset to train a random forest classifier to predict the cortical layer labels based on the spatial location and gene expression of each cell.**
180190

181191

@@ -277,6 +287,7 @@ y_pred = rf.predict(X_test)
277287
So now we have `y_pred`, but what is this output?
278288

279289
```{python}
290+
#| label: type_y_pred
280291
type(y_pred)
281292
```
282293

@@ -289,6 +300,35 @@ y_pred[0:5]
289300
```
290301

291302

303+
::: {.callout-note collapse="true"}
304+
# Visualizing a _single_ decision tree in the random forest
305+
306+
```{python}
307+
#| label: visualize_decision_tree
308+
#| fig-cap: Visualization of a single decision tree from the random forest model.
309+
#| fig-width: 40
310+
#| fig-height: 15
311+
# Source - https://stackoverflow.com/a/61037626
312+
# Posted by Michael James Kali Galarnyk
313+
# Retrieved 2026-04-09, License - CC BY-SA 4.0
314+
from sklearn import tree
315+
316+
fn = df_cortical["cell_barcode"]
317+
cn = df_cortical["cortical_layer"]
318+
319+
fig, axes = plt.subplots(nrows = 1,
320+
ncols = 1,
321+
figsize = (70, 25),
322+
dpi=300)
323+
324+
tree.plot_tree(rf.estimators_[0],
325+
feature_names = fn,
326+
class_names = cn,
327+
filled = True)
328+
```
329+
330+
:::
331+
292332
## Assessing model performance
293333

294334
At this point, we have the predicted labels for the test dataset, but how do we know if these predictions are accurate? To evaluate the performance of our model, we can compare the predicted labels to the true labels of the test dataset.

0 commit comments

Comments
 (0)