Skip to content

Commit af434ba

Browse files
committed
fix bug in tutorial
1 parent 818688b commit af434ba

2 files changed

Lines changed: 55 additions & 43 deletions

File tree

docs/tutorials/python/tutorial_scripts/upload_data_in_bulk.py

Lines changed: 26 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -5,31 +5,40 @@
55
import os
66

77
import synapseclient
8-
import synapseutils
98
from synapseclient.models import Project
109

1110
syn = synapseclient.Synapse()
1211
syn.login()
1312

1413
# Create some constants to store the paths to the data
1514
DIRECTORY_FOR_MY_PROJECT = os.path.expanduser(os.path.join("~", "my_ad_project"))
16-
PATH_TO_MANIFEST_FILE = os.path.expanduser(os.path.join("~", "manifest-for-upload.tsv"))
15+
PATH_TO_MANIFEST_FILE = os.path.expanduser(os.path.join("~", "manifest-for-upload.csv"))
1716

1817
# Step 1: Let's find the synapse ID of our project:
1918
my_project_id = syn.findEntityId(
2019
name="My uniquely named project about Alzheimer's Disease"
2120
)
2221

23-
# Step 2: Create a manifest TSV file to upload data in bulk
24-
# Note: When this command is run it will re-create your directory structure within
25-
# Synapse. Be aware of this before running this command.
26-
# If folders with the exact names already exists in Synapse, those folders will be used.
27-
synapseutils.generate_sync_manifest(
28-
syn=syn,
29-
directory_path=DIRECTORY_FOR_MY_PROJECT,
30-
parent_id=my_project_id,
31-
manifest_path=PATH_TO_MANIFEST_FILE,
32-
)
22+
# Step 2: Create a manifest CSV file to upload data in bulk
23+
# Walk the local directory tree and build a manifest with the required "path" and
24+
# "parentId" columns. Folders that do not yet exist in Synapse are created
25+
# automatically by sync_to_synapse, so we set parentId to the project for every file.
26+
# NOTE: In a future release, Project.sync_from_synapse will support writing a manifest
27+
# CSV directly, removing the need to build one manually.
28+
import pandas as pd
29+
30+
rows = []
31+
for dirpath, _dirnames, filenames in os.walk(DIRECTORY_FOR_MY_PROJECT):
32+
for filename in filenames:
33+
rows.append(
34+
{
35+
"path": os.path.join(dirpath, filename),
36+
"parentId": my_project_id,
37+
}
38+
)
39+
40+
df = pd.DataFrame(rows)
41+
df.to_csv(PATH_TO_MANIFEST_FILE, index=False)
3342

3443
# Step 3: After generating the manifest file, we can upload the data in bulk
3544
project = Project(id=my_project_id)
@@ -39,21 +48,20 @@
3948
# Pandas is a powerful data manipulation library in Python, although it is not required
4049
# for this tutorial, it is used here to demonstrate how you can manipulate the manifest
4150
# file before uploading it to Synapse.
42-
import pandas as pd
4351

44-
# Read TSV file into a pandas DataFrame
45-
df = pd.read_csv(PATH_TO_MANIFEST_FILE, sep="\t")
52+
# Read CSV file into a pandas DataFrame
53+
df = pd.read_csv(PATH_TO_MANIFEST_FILE)
4654

4755
# Add a new column to the DataFrame
4856
df["species"] = "Homo sapiens"
4957

5058
# Write the DataFrame back to the manifest file
51-
df.to_csv(PATH_TO_MANIFEST_FILE, sep="\t", index=False)
59+
df.to_csv(PATH_TO_MANIFEST_FILE, index=False)
5260

5361
project.sync_to_synapse(manifest_path=PATH_TO_MANIFEST_FILE)
5462

5563
# Step 5: Let's create an Activity/Provenance
56-
# First let's find the row in the TSV we want to update. This code finds the row number
64+
# First let's find the row in the CSV we want to update. This code finds the row number
5765
# that we would like to update.
5866
row_index = df[
5967
df["path"] == f"{DIRECTORY_FOR_MY_PROJECT}/biospecimen_experiment_1/fileA.txt"
@@ -77,6 +85,6 @@
7785
] = "Experiment results created as a result of the linked data while running the pipeline."
7886

7987
# Write the DataFrame back to the manifest file
80-
df.to_csv(PATH_TO_MANIFEST_FILE, sep="\t", index=False)
88+
df.to_csv(PATH_TO_MANIFEST_FILE, index=False)
8189

8290
project.sync_to_synapse(manifest_path=PATH_TO_MANIFEST_FILE)

docs/tutorials/python/upload_data_in_bulk.md

Lines changed: 29 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
# Uploading data in bulk
2+
23
This tutorial will follow a
34
[Flattened Data Layout](../../explanations/structuring_your_project.md#flattened-data-layout-example).
45
With a project that has this example layout:
@@ -19,10 +20,11 @@ With a project that has this example layout:
1920
```
2021

2122
## Tutorial Purpose
23+
2224
In this tutorial you will:
2325

2426
1. Find the synapse ID of your project
25-
1. Create a manifest TSV file to upload data in bulk
27+
1. Create a manifest CSV file to upload data in bulk
2628
1. Upload all of the files for our project
2729
1. Add an annotation to all of our files
2830
1. Add a provenance/activity record to one of our files
@@ -40,56 +42,59 @@ In this tutorial you will:
4042

4143

4244
## Prerequisites
45+
4346
* Make sure that you have completed the following tutorials:
4447
* [Project](./project.md)
4548
* This tutorial is setup to upload the data from `~/my_ad_project`, make sure that this or
4649
another desired directory exists.
4750
* Pandas is used in this tutorial. Refer to our
4851
[installation guide](../installation.md#pypi) to install it. Feel free to skip this
4952
portion of the tutorial if you do not wish to use Pandas. You may also use external
50-
tools to open and manipulate Tab Separated Value (TSV) files.
53+
tools to open and manipulate CSV files.
5154

5255

5356
## 1. Find the synapse ID of your project
5457

5558
First let's set up some constants we'll use in this script, and find the ID of our project
5659
```python
57-
{!docs/tutorials/python/tutorial_scripts/upload_data_in_bulk.py!lines=5-21}
60+
{!docs/tutorials/python/tutorial_scripts/upload_data_in_bulk.py!lines=5-22}
5861
```
5962

60-
## 2. Create a manifest TSV file to upload data in bulk
63+
## 2. Create a manifest CSV file to upload data in bulk
6164

62-
Let's "walk" our directory on disk to create a manifest file for upload
65+
Let's walk our local directory and build a CSV manifest with the required `path` and
66+
`parentId` columns. In a future release `Project.sync_from_synapse` will support
67+
writing a manifest CSV directly; for now we build one with pandas.
6368
```python
64-
{!docs/tutorials/python/tutorial_scripts/upload_data_in_bulk.py!lines=23-33}
69+
{!docs/tutorials/python/tutorial_scripts/upload_data_in_bulk.py!lines=23-44}
6570
```
6671

6772
<details class="example">
68-
<summary>After this has been run if you inspect the TSV file created you'll see it will look
73+
<summary>After this has been run if you inspect the CSV file created you'll see it will look
6974
similar to this:</summary>
7075
```
71-
path parent
72-
/home/user_name/my_ad_project/single_cell_RNAseq_batch_2/SRR12345678_R2.fastq.gz syn60109537
73-
/home/user_name/my_ad_project/single_cell_RNAseq_batch_2/SRR12345678_R1.fastq.gz syn60109537
74-
/home/user_name/my_ad_project/biospecimen_experiment_2/fileD.txt syn60109543
75-
/home/user_name/my_ad_project/biospecimen_experiment_2/fileC.txt syn60109543
76-
/home/user_name/my_ad_project/single_cell_RNAseq_batch_1/SRR12345678_R2.fastq.gz syn60109534
77-
/home/user_name/my_ad_project/single_cell_RNAseq_batch_1/SRR12345678_R1.fastq.gz syn60109534
78-
/home/user_name/my_ad_project/biospecimen_experiment_1/fileA.txt syn60109540
79-
/home/user_name/my_ad_project/biospecimen_experiment_1/fileB.txt syn60109540
76+
path,parentId
77+
/home/user_name/my_ad_project/single_cell_RNAseq_batch_2/SRR12345678_R2.fastq.gz,syn60109500
78+
/home/user_name/my_ad_project/single_cell_RNAseq_batch_2/SRR12345678_R1.fastq.gz,syn60109500
79+
/home/user_name/my_ad_project/biospecimen_experiment_2/fileD.txt,syn60109500
80+
/home/user_name/my_ad_project/biospecimen_experiment_2/fileC.txt,syn60109500
81+
/home/user_name/my_ad_project/single_cell_RNAseq_batch_1/SRR12345678_R2.fastq.gz,syn60109500
82+
/home/user_name/my_ad_project/single_cell_RNAseq_batch_1/SRR12345678_R1.fastq.gz,syn60109500
83+
/home/user_name/my_ad_project/biospecimen_experiment_1/fileA.txt,syn60109500
84+
/home/user_name/my_ad_project/biospecimen_experiment_1/fileB.txt,syn60109500
8085
```
8186
</details>
8287

8388
## 3. Upload the data in bulk
8489
```python
85-
{!docs/tutorials/python/tutorial_scripts/upload_data_in_bulk.py!lines=35-37}
90+
{!docs/tutorials/python/tutorial_scripts/upload_data_in_bulk.py!lines=46-48}
8691
```
8792

8893

8994
<details class="example">
9095
<summary>While this is running you'll see output in your console similar to:</summary>
9196
```
92-
Validating manifest: /home/user_name/manifest-for-upload.tsv
97+
Validating manifest: /home/user_name/manifest-for-upload.csv
9398
Validating that all paths exist...
9499
Validating that all files are unique...
95100
Validating that all the files are not empty...
@@ -103,12 +108,12 @@ Uploading 8 files: 100%|██████████████████
103108

104109

105110
## 4. Add an annotation to our manifest file
106-
At this point in the tutorial we will start to use pandas to manipulate a TSV file. If
111+
At this point in the tutorial we will use pandas to manipulate the CSV manifest. If
107112
you are not comfortable with pandas you may use any tool that can open and manipulate
108-
TSV such as excel or google sheets.
113+
CSV files such as Excel or Google Sheets.
109114

110115
```python
111-
{!docs/tutorials/python/tutorial_scripts/upload_data_in_bulk.py!lines=39-55}
116+
{!docs/tutorials/python/tutorial_scripts/upload_data_in_bulk.py!lines=50-63}
112117
```
113118

114119
Now that you have uploaded and annotated your files you'll be able to inspect your data
@@ -123,14 +128,14 @@ Let's create an [Activity/Provenance](../../explanations/domain_models_of_synaps
123128
record for one of our files. In otherwords, we will record the steps taken to generate
124129
the file.
125130

126-
In this code we are finding a row in our TSV file and pointing to the file path of
131+
In this code we are finding a row in our CSV file and pointing to the file path of
127132
another file within our manifest. By doing this we are creating a relationship between
128133
the two files. This is a simple example of how you can create a provenance record in
129134
Synapse. Additionally we'll link off to a sample URL that describes a process that we
130135
may have executed to generate the file.
131136

132137
```python
133-
{!docs/tutorials/python/tutorial_scripts/upload_data_in_bulk.py!lines=57-83}
138+
{!docs/tutorials/python/tutorial_scripts/upload_data_in_bulk.py!lines=68-92}
134139
```
135140

136141
After running this code we may again inspect the synapse web UI. In this screenshot i've
@@ -157,7 +162,6 @@ navigated to the Files tab and selected the file that we added a Provenance reco
157162

158163
- [syn.login][synapseclient.Synapse.login]
159164
- [syn.findEntityId][synapseclient.Synapse.findEntityId]
160-
- [synapseutils.generate_sync_manifest][]
161165
- [Project.sync_to_synapse][synapseclient.models.mixins.StorableContainer.sync_to_synapse]
162-
- [synapseutils.syncToSynapse][] *(deprecated)*
166+
- [Manifest CSV format](../../explanations/manifest_csv.md)
163167
- [Activity/Provenance](../../explanations/domain_models_of_synapse.md#activityprovenance)

0 commit comments

Comments
 (0)