Skip to content

Commit e939ca6

Browse files
authored
Fix exercises
1 parent 938b957 commit e939ca6

1 file changed

Lines changed: 16 additions & 15 deletions

File tree

lessons/10_data_wrangling-Answer_key.qmd

Lines changed: 16 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ important_genes = ["ENSMUSG00000083700", "ENSMUSG00000080990",
4545
"ENSMUSG00000081010", "ENSMUSG00000030970"]
4646
```
4747

48-
1. Use the `in` operator to determine if all of these genes are present in the row names of the `rpkm_ordered` data frame.
48+
1. Use the `isin` operator to determine which row names of the `rpkm_ordered` DataFrame match our genes of interest.
4949

5050
```{python}
5151
#| label: exercise_1
@@ -61,14 +61,16 @@ rpkm_ordered.index.isin(important_genes)
6161
rpkm_ordered[rpkm_ordered.index.isin(important_genes)]
6262
```
6363

64-
3. **Bonus question:** Extract the rows from `rpkm_ordered` that correspond to these 6 genes using `[]`, but without using the `isin` operator.
64+
3. **Bonus question:** Extract the rows from `rpkm_ordered` that correspond to these 6 genes using `[]`, but without using the `isin` operator. Do you notice anything different about the output?
6565

6666
```{python}
6767
#| label: exercise_3
6868
# Extract rows for important genes without using isin operator
6969
rpkm_ordered.loc[important_genes]
7070
```
7171

72+
The rows are now in the order that we specified the genes of interest. When we used `isin` above, we got the rows in the order that they are present in the expression data.
73+
7274
# Exercise 2
7375

7476
### Reading in and inspecting data
@@ -84,7 +86,6 @@ animals = pd.read_csv("data/animals.csv")
8486
animals
8587
```
8688

87-
8889
2. Check to make sure that `animals` is a dataframe.
8990

9091
```{python}
@@ -93,7 +94,6 @@ animals
9394
type(animals)
9495
```
9596

96-
9797
3. How many rows are in the `animals` dataframe? How many columns?
9898

9999
```{python}
@@ -129,12 +129,12 @@ animals[animals["color"] == "Tan"]
129129
animals.loc[animals["color"] == "Tan"]
130130
```
131131

132-
6. Return the `speed` column for the rows with animals that are the `color` Tan.
132+
6. Return the rows with animals that have speed greater than 50 km/h and output only the color column. Keep the output as a DataFrame.
133133

134134
```{python}
135135
#| label: speed_column_tan
136-
# Return the speed column for animals with color Tan
137-
animals[animals["color"] == "Tan"]["speed"]
136+
# Return the color column for animals with speed > 50 as a DataFrame
137+
animals[animals["speed"] > 50]["color"].to_frame()
138138
```
139139

140140
7. Change the color of "Grey" to "Gray".
@@ -156,7 +156,7 @@ animals_list = [animals["speed"], animals["color"]]
156156
animals_list
157157
```
158158

159-
9. Create a dictionary with the appropriate keys (i.e speed and color).
159+
9. Create a dictionary with the appropriate keys (i.e., speed and color).
160160

161161
```{python}
162162
#| label: create_dict
@@ -166,14 +166,13 @@ animals_dict = {"speed": animals["speed"],
166166
animals_dict
167167
```
168168

169+
### The `isin` operator, reordering and matching
169170

170-
### The `in` operator, reordering and matching
171-
172-
10. In the `data` directory, you should have a dataframe called `proj_summary` which contains quality metric information for an RNA-seq dataset. We have obtained batch information for the control samples in this dataset. **Copy and paste the code below to create a dataframe of control samples with the associated batch information**:
171+
10. In the `data` directory, you should have a table called `project-summary.txt` that contains quality metric information for an RNA-seq dataset. We have obtained batch information for the control samples in this dataset. **Copy and paste the code below to create DataFrames for the project summary and for the control samples with the associated batch information**:
173172

174173
```{python}
175174
#| label: create_ctrl_samples_df
176-
# Read in proj_summary if needed
175+
# Read in proj_summary
177176
proj_summary = pd.read_table("data/project-summary.txt",
178177
header=0, index_col=0)
179178
@@ -182,19 +181,20 @@ ctrl_samples = pd.DataFrame(
182181
data = {"date": ["01/13/2018", "03/15/2018", "01/13/2018",
183182
"09/20/2018","03/15/2018"]},
184183
index = ["sample3", "sample10", "sample8",
185-
"sample4", "sample15"]
186-
)
184+
"sample4", "sample15"])
187185
```
188186

189187
```{python}
190188
#| label: tbl-proj_summary
191189
#| tbl-cap: DataFrame of quality metric information for an RNA-seq dataset.
190+
# View RNA-seq QC information
192191
proj_summary
193192
```
194193

195194
```{python}
196195
#| label: tbl-ctrl_samples
197196
#| tbl-cap: DataFrame of control samples with associated batch information.
197+
# View control batch information
198198
ctrl_samples
199199
```
200200

@@ -203,7 +203,8 @@ ctrl_samples
203203
```{python}
204204
#| label: shared_ctrl_samples_proj_summary
205205
# Number of shared samples between ctrl_samples and proj_summary
206-
len(ctrl_samples.index[ctrl_samples.index.isin(proj_summary.index)])
206+
print(len(proj_summary.index[proj_summary.index.isin(ctrl_samples.index)]))
207+
print(len(ctrl_samples.index[ctrl_samples.index.isin(proj_summary.index)]))
207208
```
208209

209210
12. Keep only the rows in `proj_summary` which correspond to those in `ctrl_samples`. Do this with the `isin` operator. Save it to a variable called `proj_summary_ctrl`.

0 commit comments

Comments
 (0)