Skip to content

Commit 18ba6e1

Browse files
committed
update
1 parent 516f55f commit 18ba6e1

3 files changed

Lines changed: 23 additions & 9 deletions

File tree

03-Classification.qmd

Lines changed: 12 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ nhanes_train_binned = nhanes_train_binned.dropna()
6767
plt.clf()
6868
plt.scatter(nhanes_train_binned['bin_midpoint'], nhanes_train_binned['p'], color='blue')
6969
plt.xlabel('BMI - Binned Midpoint')
70-
plt.ylabel('Empirical Hypertension Probability')
70+
plt.ylabel('Proportion of people with Hypertension')
7171
plt.ylim(0, 1)
7272
plt.grid(True)
7373
plt.show()
@@ -88,7 +88,7 @@ $$P(Hypertension)=\beta_0+\beta_1 \cdot BMI$$
8888

8989
*does not* give us outputs between 0 and 1. To deal with this, we perform the **Logistic Transformation:**
9090

91-
$$P(Hyptertension) = \frac{1}{1+e^{-(\beta_0 + \beta_1 \cdot BMI)}}$$
91+
$$P(Hypertension) = \frac{1}{1+e^{-(\beta_0 + \beta_1 \cdot BMI)}}$$
9292

9393
This forces the right hand side of the equation to be between 0 and 1, which is at the scale of probability. The relationship between the X and Y axis is not going to be a straight line, but rather a non-linear, "S-shaped" one. Let's fit this model and look at the model visually to understand.
9494

@@ -100,7 +100,7 @@ plt.clf()
100100
plt.scatter(X_train.BMI, logit_model.predict_proba(X_train)[:, 1], color="red", label="Fitted Line")
101101
plt.scatter(nhanes_train_binned['bin_midpoint'], nhanes_train_binned['p'], color='blue')
102102
plt.xlabel('BMI')
103-
plt.ylabel('Probability of Hypertension')
103+
plt.ylabel('Proportion of people with Hypertension')
104104
plt.ylim(0, 1)
105105
plt.legend()
106106
plt.show()
@@ -146,7 +146,15 @@ Okay, that's a starting point!
146146

147147
However, we need to be mindful of the class imbalance we saw in the dataset at the beginning of the lesson. Recall we roughly have 88% of our data as No Hypertension. If we have a classifier that *always* predicted No Hypertension, then we achieve a 88% accuracy rate, but this model is not particularly novel and it raises questions of whether our model of 76% accuracy is novel.
148148

149-
We can break down classification accuracy to four additional results, via a table called the **Confusion Matrix**:
149+
Well, break down the accuracy by the Hypertension events and No Hypertension events:
150+
151+
Our **Sensitivity** (accuracy of Hypertension events) is defined as: $\frac{TruePositives}{TruePostives+FalseNegatives}$, which is 15/(15+325) = 4%
152+
153+
Our **Specificity** (accuracy of No Hypertension events) is defined as: $\frac{TrueNegatives}{TrueNegatives+FalsePositives}$, which is 1128/(1128+24) = 98%.
154+
155+
Therefore, we do a pretty terrible job of predicting the Hypertension cases!
156+
157+
We can describe the detailed numbers via a table called the **Confusion Matrix**:
150158

151159
```{python}
152160
cm = confusion_matrix(y_test, logit_model.predict(X_test))
@@ -157,11 +165,6 @@ plt.show()
157165

158166
The top left hand corner is the number of True Negatives (1128), the top right hand corner is the number of False Positives (24), the bottom left corner is the number of False Negatives (325), and the bottom right corner is the number of True Positives (15).
159167

160-
Our **Sensitivity** (accuracy of Hypertension events) is defined as: $\frac{TP}{TP+FN}$, which is 15/(15+325) = 4%
161-
162-
Our **Specificity** (accuracy of No Hypertension events) is defined as: $\frac{TN}{TN+FP}$, which is 1128/(1128+24) = 98%.
163-
164-
Therefore, we do a pretty terrible job of predicting the Hypertension cases!
165168

166169
What happened exactly? Let's look back at the Training Data: it seems that from the plots that we are making predictions of Hypertension for BMI of 50 or more. However, there are so few people with such a high BMI that even if most of those folks have Hypertension, the model missed most of the folks with Hypertension in the 20-40 BMI range. This range wasn't high enough for our decision boundary of 50% probability, so we missed out most of our Hypertension people.
167170

5.19 MB
Binary file not shown.

classroom_data_preprocess.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44

55
drug = pd.read_csv("classroom_data/Drug_sensitivity_AUC_(Sanger_GDSC1)_subsetted.csv")
66
expression = pd.read_csv("classroom_data/Expression_Public_25Q3_subsetted.csv")
7+
mutation = pd.read_csv("classroom_data/OmicsSomaticMutationsMatrixHotspot.csv")
78

89
gefitinib = drug.loc[:, ["Unnamed: 0", "GEFITINIB (GDSC1:1010)"]]
910
gefitinib = gefitinib.rename(columns={'GEFITINIB (GDSC1:1010)': 'GEFITINIB'})
@@ -14,7 +15,17 @@
1415
with open("classroom_data/GEFITINIB_Expression.pickle", 'wb') as file:
1516
pickle.dump(gefitinib_expression, file)
1617

18+
mutation = mutation.drop(columns=["Unnamed: 0", 'SequencingID', 'ModelConditionID', 'IsDefaultEntryForModel', 'IsDefaultEntryForMC'])
19+
gefitinib_mutation = gefitinib.merge(mutation, left_on = "Unnamed: 0", right_on = "ModelID")
20+
gefitinib_mutation = gefitinib_mutation.dropna(subset=["GEFITINIB"])
21+
gefitinib_mutation = gefitinib_mutation.drop(columns=["Unnamed: 0", "ModelID"])
22+
with open("classroom_data/GEFITINIB_mutation.pickle", 'wb') as file:
23+
pickle.dump(gefitinib_mutation, file)
1724

25+
26+
with open("classroom_data/GEFITINIB_Expression.pickle", 'wb') as file:
27+
pickle.dump(gefitinib_expression, file)
28+
1829
docetaxel = drug.loc[:, ["Unnamed: 0", "DOCETAXEL (GDSC1:1007)"]]
1930
docetaxel = docetaxel.rename(columns={'DOCETAXEL (GDSC1:1007)': 'DOCETAXEL'})
2031
docetaxel_expression = docetaxel.merge(expression)

0 commit comments

Comments
 (0)