Skip to content

Commit 875c8f5

Browse files
authored
[SYNPY-1800] Added new sync_to_synapse method (#1353)
* added new sync_to_synapse method * improved deprecation * created copy of SyncUploader in manifest.py * added tests * fixed unit tests * fix tests to work on windows * fixed typo bug * fix bug in tutorial * Linglings suggestions * turn send messge off in unit test * fix test fixture * fix docstrings * fix inconsistent default * fix typing * added notes about chnage in column name * tests now have send_messages = False * tutorial now has send_messages = False * minor fixes * fixed line numbers * fixed docs * reverted notes back to !!! style * add note for adding model methods * fix hierarchies * fix tutorial * update docs * update docs * remove creation of file hierarchy * various fixes * fix docstrings * refactor manifest functions for readability * various fixes * fix test and added two others * various fixes * made functions more readable * fix erroneus change to upload * removed uneeded test * added sumamry to docstring * added unit tests * remove comments * fix example formatting * improved file cleanup * fix typos * remove uneeded global * changed tests so that scheduling File cleanup happens right after files are crated in Synapse
1 parent 520760b commit 875c8f5

18 files changed

Lines changed: 4419 additions & 73 deletions

File tree

docs/CLAUDE.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@ Reference markdown files use `::: synapseclient.ClassName` syntax to trigger aut
3333
- `filters: ["!^_", "!to_synapse_request", "!fill_from_dict"]` — private members, `to_synapse_request()`, and `fill_from_dict()` are excluded from docs
3434
- `inherited_members: true` — shows mixin methods on inheriting classes
3535
- Member lists are explicit — each reference page specifies which methods to document
36+
- When adding a new public method to a model class, add it to the `members:` list in the corresponding reference pages (`docs/reference/experimental/sync/` and `docs/reference/experimental/async/`). Without this, mkdocstrings won't generate an anchor and cross-references like `[synapseclient.models.ClassName.method]` will break.
3637

3738
### Anchor links for cross-referencing
3839
Pattern: `[](){ #reference-anchor }` in reference pages. Tutorials link to reference via `[API Reference][project-reference-sync]`. Explicit type hints use: `[syn.login][synapseclient.Synapse.login]`.

docs/explanations/manifest_csv.md

Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
# Manifest CSV
2+
3+
The manifest is a CSV file with file locations and metadata used to bulk upload and download files in Synapse. It is the standard manifest format used by `Project.sync_from_synapse`, `Project.sync_to_synapse`, `Folder.sync_from_synapse`, `Folder.sync_to_synapse`, the Synapse UI download cart, and the `synapse get-download-list` CLI command.
4+
5+
!!! note
6+
This CSV manifest replaces the legacy TSV manifest produced by `synapseutils.syncFromSynapse`. The `syncFromSynapse` and `syncToSynapse` utility functions are deprecated and will be removed in v5.0.0. Use `Project.sync_from_synapse` / `Folder.sync_from_synapse` and `Project.sync_to_synapse` / `Folder.sync_to_synapse` instead. See the [legacy TSV manifest documentation](manifest_tsv.md) for details on the old format.
7+
8+
## Manifest file format
9+
10+
The format of the manifest file is a comma-separated value (CSV) file with one row per file and columns describing the file. The minimum required columns for uploading are **path** and **parentId**, where `path` is the local file path and `parentId` is the Synapse ID of the project or folder where the file is uploaded to. Values that contain commas are automatically quoted (e.g., `"hello, world"`).
11+
12+
### Required fields for upload
13+
14+
| Field | Meaning | Example |
15+
|----------|----------------------------|-------------------------|
16+
| path | local file path or URL | /path/to/local/file.txt |
17+
| parentId | Synapse ID of parent | syn1235 |
18+
19+
!!! note
20+
The legacy TSV manifest used a column named `parent`. The CSV manifest uses `parentId` instead, which is consistent with the Synapse REST API field name. If you are migrating an existing TSV manifest to CSV, rename the `parent` column to `parentId`.
21+
22+
### Standard fields
23+
24+
These columns are recognized by `sync_to_synapse` and have specific meaning. Any of these columns may be present in the manifest but only `path` and `parentId` are required for upload.
25+
Each of these are individual examples and is what you would find in a row in each of these columns. To clarify, "syn1235;/path/to_local/file.txt" below states that you would like both "syn1235" and "/path/to_local/file.txt" added as items used to generate a file. You can also specify one item by specifying "syn1234"
26+
27+
| Field | Meaning | Example |
28+
|---------------------|--------------------------------------------|----------------------------------------------|
29+
| path | local file path or URL | /path/to/local/file.txt |
30+
| parentId | Synapse ID of parent container | syn1235 |
31+
| ID | Synapse entity ID | syn2345 |
32+
| name | name of file in Synapse | Example_file |
33+
| synapseStore | whether to upload the file | True |
34+
| contentType | content type of file to overwrite defaults | text/html |
35+
| forceVersion | whether to update version | False |
36+
| activityName | name of activity in provenance | Ran normalization |
37+
| activityDescription | text description of what was done | Ran algorithm xyz with parameters... |
38+
| used | list of items used to generate file | syn1235;/path/to_local/file.txt |
39+
| executed | list of items executed | https://github.org/;/path/to_local/code.py |
40+
41+
### Metadata fields (ignored during upload)
42+
43+
These columns are present in manifests generated by the Synapse UI download cart and `synapse get-download-list` CLI. They are ignored by `sync_to_synapse` and are **not** treated as annotations.
44+
45+
| Field | Meaning |
46+
|-------------------|-------------------------------|
47+
| error | any error in downloading file |
48+
| versionNumber | version of the file |
49+
| dataFileSizeBytes | size of the file in bytes |
50+
| createdBy | user who created the file |
51+
| createdOn | date the file was created |
52+
| modifiedBy | user who last modified |
53+
| modifiedOn | date last modified |
54+
| synapseURL | URL to the file in Synapse |
55+
| dataFileMD5Hex | MD5 hash of the file |
56+
57+
### Annotations
58+
59+
Any columns that are not in the standard or metadata fields described above will be interpreted as annotations of the file.
60+
61+
Adding annotations to each row:
62+
63+
| path | parentId | annot1 | annot2 | annot3 | annot4 | annot5 |
64+
| --- | --- | --- | --- | --- | --- | --- |
65+
| /path/file1.txt | syn1243 | bar | 3.1415 | "aaaa, bbbb" | "[14,27,30]" | "Annotation, with a comma" |
66+
| /path/file2.txt | syn12433 | baz | 2.71 | value_1 | "[1,2,3]" | test 123 |
67+
| /path/file3.txt | syn12455 | zzz | 3.52 | value_3 | "[42,56,77]" | a single annotation |
68+
69+
#### Multiple values of annotations per key
70+
71+
Using multiple values for a single annotation should be used sparingly as it makes it more
72+
difficult for you to manage the data. However, it is supported.
73+
74+
**Annotations can be comma `,` separated lists surrounded by brackets `[]`.**
75+
76+
Because the manifest is a CSV file, multi-value annotations that contain commas are automatically quoted. For example, `[a,b,c]` will appear in the CSV as `"[a,b,c]"`.
77+
78+
This is an annotation with 3 values:
79+
80+
| path | parentId | annot1 |
81+
|-----------------|----------|--------------|
82+
| /path/file1.txt | syn1243 | "[a,b,c]" |
83+
84+
85+
86+
### Dates in the manifest file
87+
88+
Dates within the manifest file will always be written as [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) format in UTC without milliseconds. For example: `2023-12-20T16:55:08Z`.
89+
90+
Dates can be written in other formats specified in ISO 8601 and they will be recognized. However, `sync_from_synapse` will always write dates in the UTC format specified above. For example, you may want to specify a datetime at a specific timezone like `2023-12-20 23:55:08-07:00` and this will be recognized as a valid datetime.
91+
92+
## Manifest sources
93+
94+
The CSV manifest format is shared across multiple tools:
95+
96+
| Source | Filename |
97+
|----------------------------------------------------------|---------------------------------|
98+
| `Project.sync_from_synapse` / `Folder.sync_from_synapse` | manifest.csv |
99+
| Synapse UI download cart | manifest.csv |
100+
| CLI `synapse get-download-list` | `manifest_<timestamp>.csv` |
101+
102+
A manifest generated by any of these sources can be used as input to `sync_to_synapse`, provided the `path` column is present with valid local file paths. Manifests from the Synapse UI do not include a `path` column by default, so users must add it before uploading.
103+
104+
### Example manifest file
105+
106+
| path | parentId | ID | name | annot1 | annot2 | collection_date | used | executed |
107+
|-----------------|----------|---------|-----------|--------|--------|---------------------------|--------------------------|------------------------------|
108+
| /path/file1.txt | syn1243 | syn5001 | file1.txt | bar | 3.1415 | 2023-12-04T07:00:00Z | syn124;/path/file2.txt | https://github.org/foo/bar |
109+
| /path/file2.txt | syn12433 | syn5002 | file2.txt | baz | 2.71 | 2001-01-01T08:00:00Z | | https://github.org/foo/baz |
110+
| /path/file3.txt | syn12455 | syn5003 | file3.txt | zzz | 3.52 | 2023-12-04T07:00:00Z | | https://github.org/foo/zzz |
111+
112+
## References
113+
114+
- [Project.sync_from_synapse][synapseclient.models.Project.sync_from_synapse]
115+
- [Project.sync_to_synapse][synapseclient.models.Project.sync_to_synapse]
116+
- [Folder.sync_from_synapse][synapseclient.models.Folder.sync_from_synapse]
117+
- [Folder.sync_to_synapse][synapseclient.models.Folder.sync_to_synapse]
118+
- [Manifest TSV (legacy)](manifest_tsv.md)
119+
- [Managing custom metadata at scale](https://help.synapse.org/docs/Managing-Custom-Metadata-at-Scale.2004254976.html#ManagingCustomMetadataatScale-BatchUploadFileswithAnnotations)

docs/explanations/manifest_tsv.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
1-
# Manifest
1+
# Manifest TSV (Legacy)
22
The manifest is a tsv file with file locations and metadata to be pushed to Synapse. The purpose is to allow bulk actions through a TSV without the need to manually execute commands for every requested action.
33

4+
!!! warning "Deprecated"
5+
This TSV manifest format is produced by [synapseutils.syncFromSynapse][] and consumed by [synapseutils.syncToSynapse][], both of which are deprecated and will be removed in v5.0.0. Use `Project.sync_from_synapse` / `Folder.sync_from_synapse` and `Project.sync_to_synapse` / `Folder.sync_to_synapse` instead, which use the [CSV manifest format](manifest_csv.md).
6+
47
## Manifest file format
58

69
The format of the manifest file is a tab delimited file with one row per file to upload and columns describing the file. The minimum required columns are **path** and **parent** where path is the local file path and parent is the Synapse Id of the project or folder where the file is uploaded to.
@@ -20,6 +23,9 @@ Any additional columns will be added as annotations.
2023
| path | local file path or URL | /path/to/local/file.txt |
2124
| parent | synapse id | syn1235 |
2225

26+
!!! note "Column renamed in CSV format"
27+
The CSV manifest format uses `parentId` instead of `parent`. If you are migrating to the new [CSV manifest format](manifest_csv.md), rename the `parent` column to `parentId`.
28+
2329
### Common fields:
2430

2531
| Field | Meaning | Example |

docs/reference/experimental/async/folder.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ at your own risk.
1616
- copy_async
1717
- walk_async
1818
- sync_from_synapse_async
19+
- sync_to_synapse_async
1920
- flatten_file_list
2021
- map_directory_to_all_contained_files
2122
- get_permissions_async

docs/reference/experimental/async/project.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ at your own risk.
1515
- delete_async
1616
- walk_async
1717
- sync_from_synapse_async
18+
- sync_to_synapse_async
1819
- flatten_file_list
1920
- map_directory_to_all_contained_files
2021
- get_permissions_async

docs/reference/experimental/sync/folder.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ at your own risk.
2727
- copy
2828
- walk
2929
- sync_from_synapse
30+
- sync_to_synapse
3031
- flatten_file_list
3132
- map_directory_to_all_contained_files
3233
- get_permissions

docs/reference/experimental/sync/project.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ at your own risk.
2626
- delete
2727
- walk
2828
- sync_from_synapse
29+
- sync_to_synapse
2930
- flatten_file_list
3031
- map_directory_to_all_contained_files
3132
- get_permissions

docs/tutorials/python/tutorial_scripts/upload_data_in_bulk.py

Lines changed: 29 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -2,62 +2,64 @@
22
Here is where you'll find the code for the uploading data in bulk tutorial.
33
"""
44

5-
import os
5+
import pandas as pd
66

77
import synapseclient
8-
import synapseutils
8+
from synapseclient.models import Project
99

1010
syn = synapseclient.Synapse()
1111
syn.login()
1212

13-
# Create some constants to store the paths to the data
14-
DIRECTORY_FOR_MY_PROJECT = os.path.expanduser(os.path.join("~", "my_ad_project"))
15-
PATH_TO_MANIFEST_FILE = os.path.expanduser(os.path.join("~", "manifest-for-upload.tsv"))
13+
# Step 1: Create some constants to store the paths to the data
14+
DIRECTORY_FOR_MY_PROJECT = "test_folder" # This should exist with your files in it
15+
PATH_TO_MANIFEST_FILE = "test_manifest.csv" # This doesn't need to exist yet
16+
SYNAPSE_PROJECT_ID = "" # Put your Synapse project ID here. This is the project where you want to upload your data.
1617

17-
# Step 1: Let's find the synapse ID of our project:
18-
my_project_id = syn.findEntityId(
19-
name="My uniquely named project about Alzheimer's Disease"
20-
)
18+
# TODO switch to using new version of synapseutils/sync.py.generate_sync_manifest
19+
# https://sagebionetworks.jira.com/browse/SYNPY-1809
2120

22-
# Step 2: Create a manifest TSV file to upload data in bulk
21+
# Step 2: Create a manifest CSV file with the paths to the files and their parent folders
2322
# Note: When this command is run it will re-create your directory structure within
2423
# Synapse. Be aware of this before running this command.
2524
# If folders with the exact names already exists in Synapse, those folders will be used.
26-
synapseutils.generate_sync_manifest(
25+
26+
27+
# old function generates a TSV
28+
from synapseutils import generate_sync_manifest
29+
30+
generate_sync_manifest(
2731
syn=syn,
2832
directory_path=DIRECTORY_FOR_MY_PROJECT,
29-
parent_id=my_project_id,
33+
parent_id=SYNAPSE_PROJECT_ID,
3034
manifest_path=PATH_TO_MANIFEST_FILE,
3135
)
36+
# reformat the manifest file to work with sync_to_synapse
37+
manifest_df = pd.read_csv(PATH_TO_MANIFEST_FILE, sep="\t")
38+
manifest_df.rename(columns={"parent": "parentId"}, inplace=True)
39+
manifest_df.to_csv(PATH_TO_MANIFEST_FILE, index=False)
3240

3341
# Step 3: After generating the manifest file, we can upload the data in bulk
34-
synapseutils.syncToSynapse(
35-
syn=syn, manifestFile=PATH_TO_MANIFEST_FILE, sendMessages=False
36-
)
42+
project = Project(id=SYNAPSE_PROJECT_ID)
43+
project.sync_to_synapse(manifest_path=PATH_TO_MANIFEST_FILE, send_messages=False)
3744

3845
# Step 4: Let's add an annotation to our manifest file
3946
# Pandas is a powerful data manipulation library in Python, although it is not required
4047
# for this tutorial, it is used here to demonstrate how you can manipulate the manifest
4148
# file before uploading it to Synapse.
42-
import pandas as pd
4349

44-
# Read TSV file into a pandas DataFrame
45-
df = pd.read_csv(PATH_TO_MANIFEST_FILE, sep="\t")
50+
# Read CSV file into a pandas DataFrame
51+
df = pd.read_csv(PATH_TO_MANIFEST_FILE)
4652

4753
# Add a new column to the DataFrame
4854
df["species"] = "Homo sapiens"
4955

5056
# Write the DataFrame back to the manifest file
51-
df.to_csv(PATH_TO_MANIFEST_FILE, sep="\t", index=False)
57+
df.to_csv(PATH_TO_MANIFEST_FILE, index=False)
5258

53-
synapseutils.syncToSynapse(
54-
syn=syn,
55-
manifestFile=PATH_TO_MANIFEST_FILE,
56-
sendMessages=False,
57-
)
59+
project.sync_to_synapse(manifest_path=PATH_TO_MANIFEST_FILE, send_messages=False)
5860

5961
# Step 5: Let's create an Activity/Provenance
60-
# First let's find the row in the TSV we want to update. This code finds the row number
62+
# First let's find the row in the CSV we want to update. This code finds the row number
6163
# that we would like to update.
6264
row_index = df[
6365
df["path"] == f"{DIRECTORY_FOR_MY_PROJECT}/biospecimen_experiment_1/fileA.txt"
@@ -81,10 +83,6 @@
8183
)
8284

8385
# Write the DataFrame back to the manifest file
84-
df.to_csv(PATH_TO_MANIFEST_FILE, sep="\t", index=False)
86+
df.to_csv(PATH_TO_MANIFEST_FILE, index=False)
8587

86-
synapseutils.syncToSynapse(
87-
syn=syn,
88-
manifestFile=PATH_TO_MANIFEST_FILE,
89-
sendMessages=False,
90-
)
88+
project.sync_to_synapse(manifest_path=PATH_TO_MANIFEST_FILE, send_messages=False)

0 commit comments

Comments
 (0)