Skip to content

Commit 4e2ed0b

Browse files
committed
added new sync_to_synapse method
1 parent fa8f6fa commit 4e2ed0b

13 files changed

Lines changed: 2946 additions & 36 deletions

File tree

docs/explanations/manifest_csv.md

Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
# Manifest CSV
2+
3+
The manifest is a CSV file with file locations and metadata used to bulk upload and download files in Synapse. It is the standard manifest format used by `Project.sync_from_synapse`, `Project.sync_to_synapse`, `Folder.sync_from_synapse`, `Folder.sync_to_synapse`, the Synapse UI download cart, and the `synapse get-download-list` CLI command.
4+
5+
!!! note "Replaces legacy TSV manifest"
6+
This CSV manifest replaces the legacy TSV manifest (`SYNAPSE_METADATA_MANIFEST.tsv`) produced by [synapseutils.syncFromSynapse][]. The `syncFromSynapse` and `syncToSynapse` utility functions are deprecated and will be removed in v5.0.0. Use `Project.sync_from_synapse` / `Folder.sync_from_synapse` and `Project.sync_to_synapse` / `Folder.sync_to_synapse` instead. See the [legacy TSV manifest documentation](manifest_tsv.md) for details on the old format.
7+
8+
## Manifest file format
9+
10+
The format of the manifest file is a comma-separated values (CSV) file with one row per file and columns describing the file. The minimum required columns for uploading are **path** and **parentId**, where `path` is the local file path and `parentId` is the Synapse ID of the project or folder where the file is uploaded to.
11+
12+
Values that contain commas are automatically quoted (e.g., `"hello, world"`). This is handled by the standard CSV format using `QUOTE_MINIMAL` quoting.
13+
14+
### Required fields for upload
15+
16+
| Field | Meaning | Example |
17+
|----------|----------------------------|-------------------------|
18+
| path | local file path or URL | /path/to/local/file.txt |
19+
| parentId | Synapse ID of parent | syn1235 |
20+
21+
### Standard fields
22+
23+
These columns are recognized by `sync_to_synapse` and have specific meaning. Any of these columns may be present in the manifest but only `path` and `parentId` are required for upload.
24+
25+
| Field | Meaning | Example |
26+
|---------------------|--------------------------------------------|----------------------------------------------|
27+
| path | local file path or URL | /path/to/local/file.txt |
28+
| parentId | Synapse ID of parent container | syn1235 |
29+
| ID | Synapse entity ID | syn2345 |
30+
| name | name of file in Synapse | Example_file |
31+
| synapseStore | whether to upload the file | True |
32+
| contentType | content type of file to overload defaults | text/html |
33+
| forceVersion | whether to update version | False |
34+
| activityName | name of activity in provenance | Ran normalization |
35+
| activityDescription | text description of what was done | Ran algorithm xyz with parameters... |
36+
| used | list of items used to generate file | syn1235;/path/to_local/file.txt |
37+
| executed | list of items executed | https://github.org/;/path/to_local/code.py |
38+
39+
### Metadata fields (ignored during upload)
40+
41+
These columns are present in manifests generated by the Synapse UI download cart and `synapse get-download-list` CLI. They are ignored by `sync_to_synapse` and are **not** treated as annotations.
42+
43+
| Field | Meaning |
44+
|-------------------|-------------------------------|
45+
| error | any error in downloading file |
46+
| versionNumber | version of the file |
47+
| dataFileSizeBites | size of the file in bytes |
48+
| createdBy | user who created the file |
49+
| createdOn | date the file was created |
50+
| modifiedBy | user who last modified |
51+
| modifiedOn | date last modified |
52+
| synapseURL | URL to the file in Synapse |
53+
| dataFileMD5Hex | MD5 hash of the file |
54+
55+
### Annotations
56+
57+
Any columns that are not in the standard or metadata fields described above will be interpreted as annotations of the file.
58+
59+
Adding annotations to each row:
60+
61+
| path | parentId | annot1 | annot2 | annot3 | annot4 | annot5 |
62+
| --- | --- | --- | --- | --- | --- | --- |
63+
| /path/file1.txt | syn1243 | bar | 3.1415 | "aaaa, bbbb" | "[14,27,30]" | "Annotation, with a comma" |
64+
| /path/file2.txt | syn12433 | baz | 2.71 | value_1 | "[1,2,3]" | test 123 |
65+
| /path/file3.txt | syn12455 | zzz | 3.52 | value_3 | "[42,56,77]" | a single annotation |
66+
67+
#### Multiple values of annotations per key
68+
69+
Using multiple values for a single annotation should be used sparingly as it makes it more
70+
difficult for you to manage the data. However, it is supported.
71+
72+
**Annotations can be comma `,` separated lists surrounded by brackets `[]`.**
73+
74+
Because the manifest is a CSV file, multi-value annotations that contain commas are automatically quoted. For example, `[a,b,c]` will appear in the CSV as `"[a,b,c]"`.
75+
76+
This is an annotation with 3 values:
77+
78+
| path | parentId | annot1 |
79+
|-----------------|----------|--------------|
80+
| /path/file1.txt | syn1243 | "[a,b,c]" |
81+
82+
This is an annotation with 1 value (no brackets):
83+
84+
| path | parentId | annot1 |
85+
|-----------------|----------|----------------------------|
86+
| /path/file1.txt | syn1243 | my sentence with commas |
87+
88+
### Dates in the manifest file
89+
90+
Dates within the manifest file will always be written as [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) format in UTC without milliseconds. For example: `2023-12-20T16:55:08Z`.
91+
92+
Dates can be written in other formats specified in ISO 8601 and they will be recognized. However, `sync_from_synapse` will always write dates in the UTC format specified above. For example, you may want to specify a datetime at a specific timezone like `2023-12-20 23:55:08-07:00` and this will be recognized as a valid datetime.
93+
94+
## Manifest sources
95+
96+
The CSV manifest format is shared across multiple tools:
97+
98+
| Source | Filename | Format |
99+
|----------------------------------------------------------|---------------------------------|--------|
100+
| `Project.sync_from_synapse` / `Folder.sync_from_synapse` | manifest.csv | CSV |
101+
| Synapse UI download cart | manifest.csv | CSV |
102+
| CLI `synapse get-download-list` | manifest\_\<timestamp\>.csv | CSV |
103+
104+
A manifest generated by any of these sources can be used as input to `sync_to_synapse`, provided the `path` column is present with valid local file paths. Manifests from the Synapse UI and CLI do not include a `path` column by default, so users must add it before uploading.
105+
106+
### Example manifest file
107+
108+
| path | parentId | ID | name | annot1 | annot2 | collection_date | used | executed |
109+
|-----------------|----------|---------|-----------|--------|--------|---------------------------|--------------------------|------------------------------|
110+
| /path/file1.txt | syn1243 | syn5001 | file1.txt | bar | 3.1415 | 2023-12-04T07:00:00Z | syn124;/path/file2.txt | https://github.org/foo/bar |
111+
| /path/file2.txt | syn12433 | syn5002 | file2.txt | baz | 2.71 | 2001-01-01T08:00:00Z | | https://github.org/foo/baz |
112+
| /path/file3.txt | syn12455 | syn5003 | file3.txt | zzz | 3.52 | 2023-12-04T07:00:00Z | | https://github.org/foo/zzz |
113+
114+
## References
115+
116+
- [Project.sync_from_synapse][synapseclient.models.Project.sync_from_synapse]
117+
- [Project.sync_to_synapse][synapseclient.models.Project.sync_to_synapse]
118+
- [Folder.sync_from_synapse][synapseclient.models.Folder.sync_from_synapse]
119+
- [Folder.sync_to_synapse][synapseclient.models.Folder.sync_to_synapse]
120+
- [Manifest TSV (legacy)](manifest_tsv.md)
121+
- [Managing custom metadata at scale](https://help.synapse.org/docs/Managing-Custom-Metadata-at-Scale.2004254976.html#ManagingCustomMetadataatScale-BatchUploadFileswithAnnotations)

docs/explanations/manifest_tsv.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
1-
# Manifest
1+
# Manifest TSV (Legacy)
22
The manifest is a tsv file with file locations and metadata to be pushed to Synapse. The purpose is to allow bulk actions through a TSV without the need to manually execute commands for every requested action.
33

4+
!!! warning "Deprecated"
5+
This TSV manifest format is produced by [synapseutils.syncFromSynapse][] and consumed by [synapseutils.syncToSynapse][], both of which are deprecated and will be removed in v5.0.0. Use `Project.sync_from_synapse` / `Folder.sync_from_synapse` and `Project.sync_to_synapse` / `Folder.sync_to_synapse` instead, which use the [CSV manifest format](manifest_csv.md).
6+
47
## Manifest file format
58

69
The format of the manifest file is a tab delimited file with one row per file to upload and columns describing the file. The minimum required columns are **path** and **parent** where path is the local file path and parent is the Synapse Id of the project or folder where the file is uploaded to.

docs/tutorials/python/tutorial_scripts/upload_data_in_bulk.py

Lines changed: 5 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66

77
import synapseclient
88
import synapseutils
9+
from synapseclient.models import Project
910

1011
syn = synapseclient.Synapse()
1112
syn.login()
@@ -31,9 +32,8 @@
3132
)
3233

3334
# Step 3: After generating the manifest file, we can upload the data in bulk
34-
synapseutils.syncToSynapse(
35-
syn=syn, manifestFile=PATH_TO_MANIFEST_FILE, sendMessages=False
36-
)
35+
project = Project(id=my_project_id)
36+
project.sync_to_synapse(manifest_path=PATH_TO_MANIFEST_FILE)
3737

3838
# Step 4: Let's add an annotation to our manifest file
3939
# Pandas is a powerful data manipulation library in Python, although it is not required
@@ -50,11 +50,7 @@
5050
# Write the DataFrame back to the manifest file
5151
df.to_csv(PATH_TO_MANIFEST_FILE, sep="\t", index=False)
5252

53-
synapseutils.syncToSynapse(
54-
syn=syn,
55-
manifestFile=PATH_TO_MANIFEST_FILE,
56-
sendMessages=False,
57-
)
53+
project.sync_to_synapse(manifest_path=PATH_TO_MANIFEST_FILE)
5854

5955
# Step 5: Let's create an Activity/Provenance
6056
# First let's find the row in the TSV we want to update. This code finds the row number
@@ -83,8 +79,4 @@
8379
# Write the DataFrame back to the manifest file
8480
df.to_csv(PATH_TO_MANIFEST_FILE, sep="\t", index=False)
8581

86-
synapseutils.syncToSynapse(
87-
syn=syn,
88-
manifestFile=PATH_TO_MANIFEST_FILE,
89-
sendMessages=False,
90-
)
82+
project.sync_to_synapse(manifest_path=PATH_TO_MANIFEST_FILE)

docs/tutorials/python/upload_data_in_bulk.md

Lines changed: 20 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,14 @@ In this tutorial you will:
2727
1. Add an annotation to all of our files
2828
1. Add a provenance/activity record to one of our files
2929

30+
!!! tip "Preferred API"
31+
The recommended way to upload files in bulk is
32+
[`Project.sync_to_synapse`][synapseclient.models.mixins.StorableContainer.sync_to_synapse]
33+
(or `Folder.sync_to_synapse`).
34+
The legacy `synapseutils.syncToSynapse` is deprecated and will be removed in v5.0.0.
35+
3036
!!! warning "Uploading Very Large Files"
31-
The bulk upload approach using `synapseutils.syncToSynapse()` is optimized for uploading many files efficiently. However, if you are uploading very large files (>100 GiB each), consider using **sequential uploads with async API** instead.
37+
The bulk upload approach using `Project.sync_to_synapse()` is optimized for uploading many files efficiently. However, if you are uploading very large files (>100 GiB each), consider using **sequential uploads with async API** instead.
3238

3339
For very large file uploads, see the `execute_walk_file_sequential()` function in [uploadBenchmark.py](https://github.com/Sage-Bionetworks/synapsePythonClient/blob/develop/docs/scripts/uploadBenchmark.py#L286) as a reference implementation. This approach uses `asyncio.run(file.store_async())` with the newer async API, which has been optimized for handling very large files efficiently. In benchmarks, this pattern successfully uploaded 45 files of 100 GB each (4.5 TB total) in approximately 20.6 hours.
3440

@@ -48,14 +54,14 @@ tools to open and manipulate Tab Separated Value (TSV) files.
4854

4955
First let's set up some constants we'll use in this script, and find the ID of our project
5056
```python
51-
{!docs/tutorials/python/tutorial_scripts/upload_data_in_bulk.py!lines=5-20}
57+
{!docs/tutorials/python/tutorial_scripts/upload_data_in_bulk.py!lines=5-21}
5258
```
5359

5460
## 2. Create a manifest TSV file to upload data in bulk
5561

5662
Let's "walk" our directory on disk to create a manifest file for upload
5763
```python
58-
{!docs/tutorials/python/tutorial_scripts/upload_data_in_bulk.py!lines=21-31}
64+
{!docs/tutorials/python/tutorial_scripts/upload_data_in_bulk.py!lines=23-33}
5965
```
6066

6167
<details class="example">
@@ -76,23 +82,20 @@ path parent
7682

7783
## 3. Upload the data in bulk
7884
```python
79-
{!docs/tutorials/python/tutorial_scripts/upload_data_in_bulk.py!lines=32-36}
85+
{!docs/tutorials/python/tutorial_scripts/upload_data_in_bulk.py!lines=35-37}
8086
```
8187

8288

8389
<details class="example">
8490
<summary>While this is running you'll see output in your console similar to:</summary>
8591
```
86-
Validation and upload of: /home/user_name/manifest-for-upload.tsv
87-
Validating columns of manifest.....OK
88-
Validating that all paths exist...........OK
89-
Validating that all files are unique...OK
90-
Validating that all the files are not empty...OK
92+
Validating manifest: /home/user_name/manifest-for-upload.tsv
93+
Validating that all paths exist...
94+
Validating that all files are unique...
95+
Validating that all the files are not empty...
9196
Validating file names...
92-
OK
93-
Validating provenance...OK
94-
Validating that parents exist and are containers...OK
95-
We are about to upload 8 files with a total size of 8.
97+
Validating provenance and parent containers...
98+
About to upload 8 files with a total size of 8 bytes.
9699
Uploading 8 files: 100%|███████████████████| 8.00/8.00 [00:01<00:00, 6.09B/s]
97100
```
98101
</details>
@@ -105,7 +108,7 @@ you are not comfortable with pandas you may use any tool that can open and manip
105108
TSV such as excel or google sheets.
106109

107110
```python
108-
{!docs/tutorials/python/tutorial_scripts/upload_data_in_bulk.py!lines=37-57}
111+
{!docs/tutorials/python/tutorial_scripts/upload_data_in_bulk.py!lines=39-55}
109112
```
110113

111114
Now that you have uploaded and annotated your files you'll be able to inspect your data
@@ -127,7 +130,7 @@ Synapse. Additionally we'll link off to a sample URL that describes a process th
127130
may have executed to generate the file.
128131

129132
```python
130-
{!docs/tutorials/python/tutorial_scripts/upload_data_in_bulk.py!lines=58-90}
133+
{!docs/tutorials/python/tutorial_scripts/upload_data_in_bulk.py!lines=57-83}
131134
```
132135

133136
After running this code we may again inspect the synapse web UI. In this screenshot i've
@@ -155,5 +158,6 @@ navigated to the Files tab and selected the file that we added a Provenance reco
155158
- [syn.login][synapseclient.Synapse.login]
156159
- [syn.findEntityId][synapseclient.Synapse.findEntityId]
157160
- [synapseutils.generate_sync_manifest][]
158-
- [synapseutils.syncToSynapse][]
161+
- [Project.sync_to_synapse][synapseclient.models.mixins.StorableContainer.sync_to_synapse]
162+
- [synapseutils.syncToSynapse][] *(deprecated)*
159163
- [Activity/Provenance](../../explanations/domain_models_of_synapse.md#activityprovenance)

synapseclient/core/utils.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -354,7 +354,7 @@ def get_properties(entity):
354354
return entity.properties if hasattr(entity, "properties") else entity
355355

356356

357-
def is_url(s):
357+
def is_url(s: str) -> bool:
358358
"""Return True if the string appears to be a valid URL."""
359359
if isinstance(s, str):
360360
try:

0 commit comments

Comments
 (0)