Provenance is a concept describing the origin of something. In Synapse, it is used to describe the connections between the workflow steps used to create a particular file or set of results. Data analysis often involves multiple steps to go from a raw data file to a finished analysis. Synapse's provenance tools allow users to keep track of each step involved in an analysis and share those steps with other users.
The model Synapse uses for provenance is based on the W3C provenance spec where items are derived from an activity which has components that were used and components that were executed. Think of the used items as input files and executed items as software or code. Both used and executed items can reside in Synapse or in URLs such as a link to a GitHub commit or a link to a specific version of a software tool.
Dive into Activity/Provenance further here
In this tutorial you will:
- Add a new Activity to your File
- Add a new Activity to a specific version of your File
- Print stored activities on your File
- Delete an activity
- In order to follow this tutorial you will need to have a Project created with at least one File stored in a Folder named
biospecimen_experiment_1.
{!docs/tutorials/python/tutorial_scripts/activity.py!lines=6-24}An Activity captures what was used (input data and reference URLs) and executed (code and software) to produce a file. Here we record a QC pipeline run on the biospecimen data:
{!docs/tutorials/python/tutorial_scripts/activity.py!lines=26-53}You'll notice the output looks like:
Stored file: fileA.txt (version 1) with activity: Quality Control Analysis
Each time you store an updated file, Synapse creates a new version. You can associate a distinct activity with each version to capture the full history of how the data evolved. Here we record a downstream analysis step that used the QC-passed data from version 1:
{!docs/tutorials/python/tutorial_scripts/activity.py!lines=55-88}You'll notice the output looks like:
Stored activity 'Downstream Analysis' on file fileA.txt (version 1)
Use Activity.from_parent() to retrieve the provenance for any version of a file. Pass a parent_version_number to retrieve the activity for a specific older version:
{!docs/tutorials/python/tutorial_scripts/activity.py!lines=90-108}You'll notice the output looks like:
Activity on latest version (v1):
Name: Downstream Analysis
Description: Downstream analysis of QC-passed biospecimen samples.
Used: UsedURL(name='Seurat v5.0.0', url='https://github.com/satijalab/seurat/releases/tag/v5.0.0')
Used: UsedEntity(target_id='syn12345678', target_version_number=1)
Executed: UsedURL(name='Downstream Analysis Script', url='https://github.com/Sage-Bionetworks/analysis-scripts/blob/v1.0/downstream_analysis.py')
Activity on version 1:
Name: Quality Control Analysis
Description: Initial QC analysis of biospecimen data using the FastQC pipeline.
Deleting an activity disassociates it from the entity. Once the activity is no longer referenced by any entity, Synapse removes it entirely. If the same activity is shared across multiple entities you will need to call Activity.delete() on each of them:
{!docs/tutorials/python/tutorial_scripts/activity.py!lines=110-118}You'll notice the output looks like:
Deleted activity from: fileA.txt (version 1)
Activity after deletion: None
Click to show me
{!docs/tutorials/python/tutorial_scripts/activity.py!}- [Activity][synapseclient.models.Activity]
- [UsedEntity][synapseclient.models.UsedEntity]
- [UsedURL][synapseclient.models.UsedURL]
- [File][file-reference-sync]
- [syn.login][synapseclient.Synapse.login]