You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The human cortex is a great use case for this technology because the brain is divvied into layers 1-6 and a white matter layer. A good analogy would be to think of them as onion layers, where the layers are stacked right on top of each other spatially (x, y coordinates). There are relatively clear boundaries between each layer.
72
+
The human cortex is a great use case for this technology because the brain is divvied into layers 1-6 and a white matter layer. A good analogy would be to think of them as an onion, where the layers are stacked right on top of each other spatially (x, y coordinates).
73
+
74
+
The layers are relatively distinct in their spatial locations but also have genes that are expressed highly in one layer and not the others. This makes it a great use case for machine learning because we can use both the spatial location and gene expression to predict which layer a cell belongs to.
73
75
:::
74
76
75
77
::: column
@@ -83,14 +85,12 @@ _Image source: [Rai et al. (2026)](https://www.biorxiv.org/content/10.64898/2026
83
85
84
86
:::
85
87
86
-
**We will be using this coritcal dataset with labelled cortical layers to train a random forest classifier to predict the cortical layer labels of cells.**
88
+
**We will be using a synthetic cortical dataset with labelled cortical layers to train a random forest classifier to predict which layer a cell belongs to.**
87
89
88
90
89
91
### Cortical information
90
92
91
-
We have created a made-up dataset (as the data has not been published yet) based upon [this dataset](https://www.biorxiv.org/content/10.64898/2026.01.12.698703v1.full). Where layers are broken into 6 cortical layers (L1, L2, L3, L4, L5, L6) and a white matter layer. Each of these layers has a unique spatial location.
92
-
93
-
The dataset contains spatial coordinates of cells in the cortex, as well as the cortical layer that each cell belongs to.
93
+
We have created a made-up dataset (as the data has not been published yet) based upon [this dataset](https://www.biorxiv.org/content/10.64898/2026.01.12.698703v1.full). Where layers are broken into 6 cortical layers (L1, L2, L3, L4, L5, L6) and a white matter layer. The dataset contains spatial coordinates of cells in the cortex, as well as the cortical layer that each cell belongs to.
94
94
95
95
```{python}
96
96
#| label: tbl-load_cortical_data
@@ -104,6 +104,7 @@ from sklearn.ensemble import RandomForestClassifier
104
104
from sklearn.model_selection import train_test_split
105
105
from sklearn.metrics import accuracy_score, confusion_matrix
However, just using the x and y coordinates of each spot does not adequately represent the complexity of the data. We also have the expression of various genes for each spot as well. Lucky for us, the different cortical layers have known genes that are highly expressed in distinct layers. This is an additional pieces of information that we can use to predict
139
+
So now we have a better idea of what the cross-section of the cortex looks like and where the different layers are located.
139
140
141
+
However, just using the x and y coordinates of each spot is not enough. You may have noticed that there apepars to be some mixing of layers near the boundaries. Luckily for us, the different cortical layers have known genes that are highly expressed in distinct layers. We can take a quick look at some canonical markers that are used to identify the different cortical layers:
140
142
141
143
::: {#fig-cortical_marker_genes .figure}
142
144
{width=550}
@@ -145,7 +147,7 @@ Example of the spatial expression of known marker genes for each cortical layer.
145
147
_Image source: [Rai et al. (2026)](https://www.biorxiv.org/content/10.64898/2026.01.12.698703v1.full)_
146
148
:::
147
149
148
-
In the dataframe, you have have noted that we also have columns: `AQP4`, `HPCAL1`, `FREM3`, `TRABD2A`, `KRT17`, and `MOBP`. These are the log-normalized expression values for those genes in each cell. These genes are known to be highly expressed in specific cortical layers, so they can be used as markers to identify which layer a cell belongs to based on its gene expression profile. Once again we can visualize the expression of these marker genes across the cortex to see how they are distributed across the different layers:
150
+
In the dataset, you have have noted that we also have columns: `AQP4`, `HPCAL1`, `FREM3`, `TRABD2A`, `KRT17`, and `MOBP`. These are the log-normalized expression values for those genes in each cell. Similar to the figure from above, we can visualize the expression of these marker genes in each cell (point) across the cortex to see the pattern of values across the different layers. This will give us a better idea of how we can use both the spatial location and gene expression to predict which layer a cell belongs to.
149
151
150
152
```{python}
151
153
#| label: fig-cortical_marker_genes
@@ -176,6 +178,14 @@ plt.tight_layout()
176
178
plt.show()
177
179
```
178
180
181
+
::: {.callout-note collapse="true"}
182
+
# Making multiple subplots in a loop
183
+
In the above code, we first initialized a plot with 2 rows and 3 columns - with the goal of plotting each of the 6 genes in our dataset. This creates an **array of plots** which we can then access to generate each of our plots.
184
+
185
+
So we could have proceeded using the `[]` indexing we have been using for matrices, but that is more complex as we would need to keep track of both rows and columns in the for loop. Instead, we can use the `flatten()` method to convert this **2D array of plots into a list of plots**, which is easier to index in the for loop.
186
+
187
+
:::
188
+
179
189
**We will be using this synthetic dataset to train a random forest classifier to predict the cortical layer labels based on the spatial location and gene expression of each cell.**
180
190
181
191
@@ -277,6 +287,7 @@ y_pred = rf.predict(X_test)
277
287
So now we have `y_pred`, but what is this output?
278
288
279
289
```{python}
290
+
#| label: type_y_pred
280
291
type(y_pred)
281
292
```
282
293
@@ -289,6 +300,35 @@ y_pred[0:5]
289
300
```
290
301
291
302
303
+
::: {.callout-note collapse="true"}
304
+
# Visualizing a _single_ decision tree in the random forest
305
+
306
+
```{python}
307
+
#| label: visualize_decision_tree
308
+
#| fig-cap: Visualization of a single decision tree from the random forest model.
309
+
#| fig-width: 40
310
+
#| fig-height: 15
311
+
# Source - https://stackoverflow.com/a/61037626
312
+
# Posted by Michael James Kali Galarnyk
313
+
# Retrieved 2026-04-09, License - CC BY-SA 4.0
314
+
from sklearn import tree
315
+
316
+
fn = df_cortical["cell_barcode"]
317
+
cn = df_cortical["cortical_layer"]
318
+
319
+
fig, axes = plt.subplots(nrows = 1,
320
+
ncols = 1,
321
+
figsize = (70, 25),
322
+
dpi=300)
323
+
324
+
tree.plot_tree(rf.estimators_[0],
325
+
feature_names = fn,
326
+
class_names = cn,
327
+
filled = True)
328
+
```
329
+
330
+
:::
331
+
292
332
## Assessing model performance
293
333
294
334
At this point, we have the predicted labels for the test dataset, but how do we know if these predictions are accurate? To evaluate the performance of our model, we can compare the predicted labels to the true labels of the test dataset.
0 commit comments