This repository contains Python scripts for generating and processing configuration files that enable the transformation of MOHCCN JSON data to a JSON structure ready for CanDIG ingestion of records compliant with OMOP CDM v5.4. It currently assumes you are running this in a location that can access a vocabulary server (see Vocabulary Server Setup).
If you simply want to transform MOHCCN data from CanDIGv2 to the OMOP ingest format, see Transform MOHCCN data to OMOP.
The scripts in Run ETL pipeline demonstrate the full ETL development process of going from the MOHCCN data model to an OMOP mapping. This is useful if you want to know more about the mapping process, make changes to the mapping, or use this process as a starting point for mapping a different data model.
Before you begin, ensure you have met the following requirements:
- You have installed Python 3.7 or higher
- You have a Windows, Linux, or macOS machine
- You have set the appropriate configuration variables in the
.envfile - This script can access a vocabulary server, see Vocabulary Server Setup
To install MOHCCN-to-OMOP ETL Script Library, follow these steps. We recommend creating a virtual environment (using venv or your tool of choice). Once your environment is active:
git clone https://github.com/CanDIG/mohccn-omop-etl
cd mohccn-omop-etl
pip install -r requirements.txtThe transform_mohccn_to_omop.py script will transform a MOHCCN JSON file from CanDIG to the OMOP ingest JSON using the ETL pipeline.
This script uses the following environment variables, defined in .env. By default, these point to the example files in the config directory. Unless you want to change the how the ETL process works, the default values should be fine. See Configuration for details on environment variables.
GENERATED_VALIDATION_RULES_JSON_FILE_NAMEGENERATED_MOHCCN_TO_OMOP_ETL_SETTINGS_JSON_FILE_NAMEEXISTING_MAPPED_VOCAB_XLSX_FILE_NAME
It also requires access to the vocaculary server defined in VOCAB_SERVER_SEARCH_BY_CODE_URL. See see Vocabulary Server Setup
Transforms MOHCCN JSON data to OMOP CDM format. Outputs final JSON structure that is nested by dataset then donor, e.g.:
{
"datasets": [
{
"dataset": {
"id": "SYNTH_01",
"linked_records": [
{
"person": {
/* OMOP table columns */
},
"linked_records": [
{
/* OMOP table columns */
},
...
]
}
]
}
}
]
}python transform_MOHCCN_to_omop.py -i input_MOHCCN_data.jsonOptions:
-i, --input_MOHCCN_json_file- Path to source MOHCCN JSON file (required)-o, --output_final_json_file- Path to output JSON file (default: output_omop_data.json)--debug_output_log_file- Path to debug log file (optional)--debug_raw_json_node_paths_file- Path to log file for all node paths (optional)--debug_omop_json_file- Path to JSON file with full debugging info (optional)--debug_omop_validation_issues_csv_file- Path to CSV file with schema validation results
Example With Options:
python transform_MOHCCN_to_omop.py \
-i input_mohccn_data.json \
-o output_omop_data.json \
--debug_output_log_file debug_output.log \
--debug_omop_json_file debug_omop.json \
--debug_raw_json_node_paths_file raw_data_node_paths.txt
--debug_omop_validation_issues_csv_file debug_omop_validation_issues.csv- Makes API requests to vocabulary endpoint
VOCAB_SERVER_SEARCH_BY_CODE_URLfor certain MOHCCN fields, e.g. Primary Diagnosis.cancer_type_code - If command line argument --debug_output_log_file is provided, detailed debugging info is saved to file.
- If command line argument --debug_omop_json_file is provided, complete unnested transformed OMOP JSON is saved to file with skip_errors and source JSON path data.
- If command line argument --debug_raw_json_node_paths_file is provided, all distinct source JSON node paths is saved to file for debugging.
The ETL pipeline generates the MOHCCN to OMOP mappings required for transform_mohccn_to_omop.py to transform an MOHCCN file to OMOP.
Edit the .env file in the project root with the following variables. The variables are described in further detail in their respective Python scripts.
MOHCCN_STANDARD_DEFINITION_XLSX_FILE_NAME - Location of MOHCCN Excel spreadsheet that describes the MOHCCN Schema field requirements, data types and permitted values (e.g. 'config/mohccn_clinical_data_modelv3-1_sep2024.xlsx'). Downloaded from https://www.marathonofhopecancercentres.ca/docs/default-source/policies-and-guidelines/cdm-v3.1/mohccn_clinical_data_modelv3-1_sep2024.xlsx?sfvrsn=331eefaa_1
GENERATION_SETTINGS_VALIDATION_RULES_JSON_FILE_NAME - Location of JSON settings for extracting validation rules and permitted values from Excel spreadsheet of MOHCCN Standard Definition. Example: 'config/mohccn_validation_rules_generation_settings.json'
GENERATED_VALIDATION_RULES_JSON_FILE_NAME - Location of JSON file generated by the script 'generate_validation_rules.py' of validation rules and permitted values extracted from MOHCCN_STANDARD_DEFINITION_XLSX_FILE_NAME. Example: 'config/mohccn_validation_rules.json.
GENERATED_UNMAPPED_VOCAB_CSV_FILE_NAME - Location of CSV file generated by the script 'generate_vocabulary.py' of permitted values from MOHCCN_STANDARD_DEFINITION_XLSX_FILE_NAME that require vocabulary mapping. Example: 'config/mohccn_vocabulary_unmapped.csv'
GENERATED_MAPPED_VOCAB_FILE_NAME_PREFIX - Prefix of CSV and XLSX files generated by the script 'process_vocabulary_mapping.py' of mapped vocabulary results. Example: 'mohccn_vocabulary_mapped_'
EXISTING_MAPPED_VOCAB_XLSX_FILE_NAME - Location of XLSX file of vocabulary mappings utilized by scripts 'process_vocabulary_mapping.py' and 'transform_mohccn_to_omop.py'. Example: 'config/mohccn_vocabulary_mapped_20250909_JL.xlsx'
MOHCCN_TO_OMOP_ETL_MODEL_XLSX_FILE_NAME - Location of XLSX file utilized by script 'transform_mohccn_to_omop.py' that contains metadata for the transformation model from MOHCCN JSON to OMOP CDM JSON. Example: 'config/MOHCCN to OMOP CDM Map v0.4.xlsx'
GENERATED_MOHCCN_TO_OMOP_ETL_SETTINGS_JSON_FILE_NAME - Location of JSON file generated by script 'generate_etl_rules.py' that converts MOHCCN_TO_OMOP_ETL_MODEL_XLSX_FILE_NAME into a format utilized by script 'transform_mohccn_to_omop.py'. Example: 'config/mohccn_to_omop_etl_rules.json'
VOCAB_SERVER_SEARCH_BY_TERM_URL - URL of web page used by script 'process_vocabulary_mapping.py' to map terms from GENERATED_UNMAPPED_VOCAB_CSV_FILE_NAME to Athena concept IDs. Example: 'https://techna-omop.uhndata.io/dhdp-vocab-search/index.php'
VOCAB_SERVER_SEARCH_BY_CODE_URL - URL of web page used by script 'transform_mohccn_to_omop.py' to map certain coded values in source MOHCCN JSON to Athena concept IDs. Example: 'https://techna-omop.uhndata.io/dhdp-vocab-search/search-by-code.php'
CANDIG_API_SCHEMA_YML_URL - URL to CANDIG OMOP CDM OpenAPI schema, used for validating transformed OMOP JSON
Example: 'https://raw.githubusercontent.com/CanDIG/candig-api/refs/heads/mshadbolt/ingest-schema-update/schema.yml'
This loads MOHCCN_STANDARD_DEFINITION_XLSX_FILE_NAME and GENERATION_SETTINGS_VALIDATION_RULES_JSON_FILE_NAME and generates GENERATED_VALIDATION_RULES_JSON_FILE_NAME, which contains validation rules and permitted values for MOHCCN source JSON data, organized by schema and field, e.g.:
{
"Sample Registration": {
"program_id": {
"dataType": "Text",
"requiredAlways": true
},
"specimen_tissue_source": {
"dataType": "Text",
"requiredAlways": true,
"validationType": "list",
"validationInfo": [
"Blood derived - peripheral blood",
"Solid tissue",
"Bone marrow",
"Other"
]
}
}
}python generate_validation_rules.py- Maps Schema column values to first-level keys in the JSON structure
- Maps MOHCCN Clinical Field values to second-level keys
- Determines data types automatically (Text, JSON, Number, Integer)
- Processes requirement specifications (Required, Optional, Conditional)
- Extracts permitted values from Permissible Values column
- Loads permitted values from other Excel tabs when referenced
Extracts all vocabulary terms from GENERATED_VALIDATION_RULES_JSON_FILE_NAME and generates GENERATED_UNMAPPED_VOCAB_CSV_FILE_NAME that contains vocabulary terms that requires mapping.
python generate_vocabulary.py- Extracts all list-type validation rules from
GENERATED_VALIDATION_RULES_JSON_FILE_NAME - Creates individual rows for each vocabulary term to be mapped
- Includes columns for mapping to target vocabularies:
- Schema
- Field
- Original Term
- Preferred Target Classes
- Preferred Target Domains
- Preferred Target Vocabularies
GENERATED_UNMAPPED_VOCAB_CSV_FILE_NAMEcontains columnsPreferred Target Classes,Preferred Target Domains, andPreferred Target Vocabulariesthat can be populated to help with mapping.- If
GENERATED_UNMAPPED_VOCAB_CSV_FILE_NAMEalready exists, values forPreferred Target Classes,Preferred Target Domains, andPreferred Target Vocabulariesare loaded and preserved.
Loads vocabulary to be mapped from GENERATED_UNMAPPED_VOCAB_CSV_FILE_NAME and existing mappings from EXISTING_MAPPED_VOCAB_XLSX_FILE_NAME. Uses columns Preferred Target Classes, Preferred Target Domains, and Preferred Target Vocabularies to help map vocabulary using API endpoint VOCAB_SERVER_SEARCH_BY_TERM_URL.
Outputs a CSV file and Excel file with format of GENERATED_MAPPED_VOCAB_FILE_NAME_PREFIX plus the current date.
python process_vocabulary_mapping.py- Preserves existing mappings and notes from previous runs
- Makes API requests to vocabulary endpoint
VOCAB_SERVER_SEARCH_BY_TERM_URLfor unmapped terms, e.g. https://techna-omop.uhndata.io/dhdp-vocab-search/index.php?term=ORIGINAL_TERM&limit=10&preferredVocabs=SNOMED,LOINC&preferredDomains=&preferredClasses - Mapped vocabulary is added into the following columns:
- Mapped Term
- Concept ID: Athena concept ID
- Concept Code: Vocabulary-specific concept code
- Term Type (Concept/Synonym)
- Match Type (Exact*, Inexact*, etc.)
- Class
- Domain
- Vocabulary
- Mapped By: For new mappings, value is set to [Your Name] for easy search and replace
- Approved By
- Review URL: URL that links to Vocabulary Search GUI for reviewing mapping results
- Notes
- Generates timestamped output files to preserve version history
- "Exact*" - Multiple exact matches found
- "Inexact*" - Multiple inexact matches found and first result is inexact
- "Exact" - Single exact match
- "Inexact" - Single inexact match
Loads Excel MOHCCN_TO_OMOP_ETL_MODEL_XLSX_FILE_NAME and generates GENERATED_MOHCCN_TO_OMOP_ETL_SETTINGS_JSON_FILE_NAME into a JSON structure representing the transformation model from MOHCCN JSON to OMOP CDM JSON, e.g.:
{
"JSON Level 1": {
"Data Group 1": {
"OMOP Table 1": [
{
"Target Field": "person_id",
"Instruction": "Get Unique ID",
"MOHCCN Source Schema": "Donor",
"Source Field": "submitter_donor_id",
"Value": "OMOP Patient Person",
"Requirement Rule": "Skip Record"
}
],
"OMOP Table 2": [ /* more rules */ ]
}
}
}python generate_etl_rules.py- Organizes rules hierarchically by:
- Source JSON Level
- DHDP Data Group
- OMOP Table
Sets a fixed value for an OMOP column regardless of source data. MOHCCN Source Schema and Source Field columns must be set to 'N/A'.
Sets a fixed value only if the source value is not empty. MOHCCN Source Schema and Source Field columns must be set to 'N/A'.
Copies a value directly from Value column. MOHCCN Source Schema and Source Field columns must be set to 'N/A'. Global Variables are identified by '{global_var_name}' and replaced accordingly.
Generates a unique identifier map to unambigiously identify a generated PK value by a combination:
- DHDP Data Group, which encapsulated both the target OMOP table as well as unique hierarchical context in source MOHCCN JSON, e.g. 'OMOP Primary Dx Initial Diagnosis Measurement'
- Source schema and field, e.g. Primary Diagnosis.submitter_primary_diagnosis_id
Generates a unique ID based on a specific JSON path in the source data.
Retrieves a previously generated unique ID, potentially using global variables, where the Value column is the associated DHDP Data Group that identifies the hiearchical context.
Retrieves a unique ID without using global variables.
Retrieves a unique ID from a specific JSON path.
Looks up a concept ID from vocabulary mappings if source MOHCCN value is a textual term, e.g. Primary Diagnosis.basis_of_diagnosis.
Looks up a concept ID from vocabulary API endpoint if source MOHCCN value is a code, e.g. Primary Diagnosis.cancer_type_code.
Calculates a year value from a date interval JSON object.
Calculates a month value from a date interval JSON object.
Calculates a day value from a date interval JSON object.
Calculates a full date from a date interval JSON object.
Concatenates multiple values using a separator (default: |).
Resets global variables used for cross-record data sharing.
Saves a value to global variables for use in subsequent transformations.
The ETL pipeline requires access to an OMOP Vocabulary Server, both to access the standard OMOP Athena vocabularies and to make modifications to the standard vocabulary.
We provide a set of docker compose files in vocabulary_server_setup for setting up the vocabulary server. See vocabulary_server_setup/readme for more details on setting up the server.
Add information about what tests are available for the software and how to run them
See our CONTRIBUTING page for details on how to contribute to this project.
Thanks to the following people who have contributed to this project:
All CanDIG code is licensed with LGPL-3.0 license.