Skip to content

CanDIG/mohccn-omop-etl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

80 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

MOHCCN-to-OMOP ETL Script Library

This repository contains Python scripts for generating and processing configuration files that enable the transformation of MOHCCN JSON data to a JSON structure ready for CanDIG ingestion of records compliant with OMOP CDM v5.4. It currently assumes you are running this in a location that can access a vocabulary server (see Vocabulary Server Setup).

If you simply want to transform MOHCCN data from CanDIGv2 to the OMOP ingest format, see Transform MOHCCN data to OMOP.

The scripts in Run ETL pipeline demonstrate the full ETL development process of going from the MOHCCN data model to an OMOP mapping. This is useful if you want to know more about the mapping process, make changes to the mapping, or use this process as a starting point for mapping a different data model.

Prerequisites

Before you begin, ensure you have met the following requirements:

  • You have installed Python 3.7 or higher
  • You have a Windows, Linux, or macOS machine
  • You have set the appropriate configuration variables in the .env file
  • This script can access a vocabulary server, see Vocabulary Server Setup

Installing MOHCCN-to-OMOP ETL Script Library

To install MOHCCN-to-OMOP ETL Script Library, follow these steps. We recommend creating a virtual environment (using venv or your tool of choice). Once your environment is active:

git clone https://github.com/CanDIG/mohccn-omop-etl
cd mohccn-omop-etl
pip install -r requirements.txt

Transform MOHCCN data to OMOP

The transform_mohccn_to_omop.py script will transform a MOHCCN JSON file from CanDIG to the OMOP ingest JSON using the ETL pipeline.

Configuration

This script uses the following environment variables, defined in .env. By default, these point to the example files in the config directory. Unless you want to change the how the ETL process works, the default values should be fine. See Configuration for details on environment variables.

  • GENERATED_VALIDATION_RULES_JSON_FILE_NAME
  • GENERATED_MOHCCN_TO_OMOP_ETL_SETTINGS_JSON_FILE_NAME
  • EXISTING_MAPPED_VOCAB_XLSX_FILE_NAME

It also requires access to the vocaculary server defined in VOCAB_SERVER_SEARCH_BY_CODE_URL. See see Vocabulary Server Setup

Purpose

Transforms MOHCCN JSON data to OMOP CDM format. Outputs final JSON structure that is nested by dataset then donor, e.g.:

{
    "datasets": [
        {
            "dataset": {
                "id": "SYNTH_01",
                "linked_records": [
                    {
                        "person": {
                            /* OMOP table columns */
                        },
                        "linked_records": [
                            {
                                /* OMOP table columns */
                            },
							...
						]
					}
				]
			}
		}
	]
}

Usage

python transform_MOHCCN_to_omop.py -i input_MOHCCN_data.json

Options:

  • -i, --input_MOHCCN_json_file - Path to source MOHCCN JSON file (required)
  • -o, --output_final_json_file - Path to output JSON file (default: output_omop_data.json)
  • --debug_output_log_file - Path to debug log file (optional)
  • --debug_raw_json_node_paths_file - Path to log file for all node paths (optional)
  • --debug_omop_json_file - Path to JSON file with full debugging info (optional)
  • --debug_omop_validation_issues_csv_file - Path to CSV file with schema validation results

Example With Options:

python transform_MOHCCN_to_omop.py \
  -i input_mohccn_data.json \
  -o output_omop_data.json \
  --debug_output_log_file debug_output.log \
  --debug_omop_json_file debug_omop.json \
  --debug_raw_json_node_paths_file raw_data_node_paths.txt
  --debug_omop_validation_issues_csv_file debug_omop_validation_issues.csv

Features

  • Makes API requests to vocabulary endpoint VOCAB_SERVER_SEARCH_BY_CODE_URL for certain MOHCCN fields, e.g. Primary Diagnosis.cancer_type_code
  • If command line argument --debug_output_log_file is provided, detailed debugging info is saved to file.
  • If command line argument --debug_omop_json_file is provided, complete unnested transformed OMOP JSON is saved to file with skip_errors and source JSON path data.
  • If command line argument --debug_raw_json_node_paths_file is provided, all distinct source JSON node paths is saved to file for debugging.

Run ETL pipeline

The ETL pipeline generates the MOHCCN to OMOP mappings required for transform_mohccn_to_omop.py to transform an MOHCCN file to OMOP.

Configuration

Edit the .env file in the project root with the following variables. The variables are described in further detail in their respective Python scripts.

MOHCCN_STANDARD_DEFINITION_XLSX_FILE_NAME - Location of MOHCCN Excel spreadsheet that describes the MOHCCN Schema field requirements, data types and permitted values (e.g. 'config/mohccn_clinical_data_modelv3-1_sep2024.xlsx'). Downloaded from https://www.marathonofhopecancercentres.ca/docs/default-source/policies-and-guidelines/cdm-v3.1/mohccn_clinical_data_modelv3-1_sep2024.xlsx?sfvrsn=331eefaa_1

GENERATION_SETTINGS_VALIDATION_RULES_JSON_FILE_NAME - Location of JSON settings for extracting validation rules and permitted values from Excel spreadsheet of MOHCCN Standard Definition. Example: 'config/mohccn_validation_rules_generation_settings.json'

GENERATED_VALIDATION_RULES_JSON_FILE_NAME - Location of JSON file generated by the script 'generate_validation_rules.py' of validation rules and permitted values extracted from MOHCCN_STANDARD_DEFINITION_XLSX_FILE_NAME. Example: 'config/mohccn_validation_rules.json.

GENERATED_UNMAPPED_VOCAB_CSV_FILE_NAME - Location of CSV file generated by the script 'generate_vocabulary.py' of permitted values from MOHCCN_STANDARD_DEFINITION_XLSX_FILE_NAME that require vocabulary mapping. Example: 'config/mohccn_vocabulary_unmapped.csv'

GENERATED_MAPPED_VOCAB_FILE_NAME_PREFIX - Prefix of CSV and XLSX files generated by the script 'process_vocabulary_mapping.py' of mapped vocabulary results. Example: 'mohccn_vocabulary_mapped_'

EXISTING_MAPPED_VOCAB_XLSX_FILE_NAME - Location of XLSX file of vocabulary mappings utilized by scripts 'process_vocabulary_mapping.py' and 'transform_mohccn_to_omop.py'. Example: 'config/mohccn_vocabulary_mapped_20250909_JL.xlsx'

MOHCCN_TO_OMOP_ETL_MODEL_XLSX_FILE_NAME - Location of XLSX file utilized by script 'transform_mohccn_to_omop.py' that contains metadata for the transformation model from MOHCCN JSON to OMOP CDM JSON. Example: 'config/MOHCCN to OMOP CDM Map v0.4.xlsx'

GENERATED_MOHCCN_TO_OMOP_ETL_SETTINGS_JSON_FILE_NAME - Location of JSON file generated by script 'generate_etl_rules.py' that converts MOHCCN_TO_OMOP_ETL_MODEL_XLSX_FILE_NAME into a format utilized by script 'transform_mohccn_to_omop.py'. Example: 'config/mohccn_to_omop_etl_rules.json'

VOCAB_SERVER_SEARCH_BY_TERM_URL - URL of web page used by script 'process_vocabulary_mapping.py' to map terms from GENERATED_UNMAPPED_VOCAB_CSV_FILE_NAME to Athena concept IDs. Example: 'https://techna-omop.uhndata.io/dhdp-vocab-search/index.php'

VOCAB_SERVER_SEARCH_BY_CODE_URL - URL of web page used by script 'transform_mohccn_to_omop.py' to map certain coded values in source MOHCCN JSON to Athena concept IDs. Example: 'https://techna-omop.uhndata.io/dhdp-vocab-search/search-by-code.php'

CANDIG_API_SCHEMA_YML_URL - URL to CANDIG OMOP CDM OpenAPI schema, used for validating transformed OMOP JSON Example: 'https://raw.githubusercontent.com/CanDIG/candig-api/refs/heads/mshadbolt/ingest-schema-update/schema.yml'

1. generate_validation_rules.py

Purpose

This loads MOHCCN_STANDARD_DEFINITION_XLSX_FILE_NAME and GENERATION_SETTINGS_VALIDATION_RULES_JSON_FILE_NAME and generates GENERATED_VALIDATION_RULES_JSON_FILE_NAME, which contains validation rules and permitted values for MOHCCN source JSON data, organized by schema and field, e.g.:

{
  "Sample Registration": {
    "program_id": {
      "dataType": "Text",
      "requiredAlways": true
    },
    "specimen_tissue_source": {
      "dataType": "Text",
      "requiredAlways": true,
      "validationType": "list",
      "validationInfo": [
        "Blood derived - peripheral blood",
        "Solid tissue",
        "Bone marrow",
        "Other"
      ]
    }
  }
}

Usage

python generate_validation_rules.py

Features

  • Maps Schema column values to first-level keys in the JSON structure
  • Maps MOHCCN Clinical Field values to second-level keys
  • Determines data types automatically (Text, JSON, Number, Integer)
  • Processes requirement specifications (Required, Optional, Conditional)
  • Extracts permitted values from Permissible Values column
  • Loads permitted values from other Excel tabs when referenced

2. generate_vocabulary.py

Purpose

Extracts all vocabulary terms from GENERATED_VALIDATION_RULES_JSON_FILE_NAME and generates GENERATED_UNMAPPED_VOCAB_CSV_FILE_NAME that contains vocabulary terms that requires mapping.

Usage

python generate_vocabulary.py

Features

  • Extracts all list-type validation rules from GENERATED_VALIDATION_RULES_JSON_FILE_NAME
  • Creates individual rows for each vocabulary term to be mapped
  • Includes columns for mapping to target vocabularies:
    • Schema
    • Field
    • Original Term
    • Preferred Target Classes
    • Preferred Target Domains
    • Preferred Target Vocabularies
  • GENERATED_UNMAPPED_VOCAB_CSV_FILE_NAME contains columns Preferred Target Classes, Preferred Target Domains, and Preferred Target Vocabularies that can be populated to help with mapping.
  • If GENERATED_UNMAPPED_VOCAB_CSV_FILE_NAME already exists, values for Preferred Target Classes, Preferred Target Domains, and Preferred Target Vocabularies are loaded and preserved.

3. process_vocabulary_mapping.py

Purpose

Loads vocabulary to be mapped from GENERATED_UNMAPPED_VOCAB_CSV_FILE_NAME and existing mappings from EXISTING_MAPPED_VOCAB_XLSX_FILE_NAME. Uses columns Preferred Target Classes, Preferred Target Domains, and Preferred Target Vocabularies to help map vocabulary using API endpoint VOCAB_SERVER_SEARCH_BY_TERM_URL. Outputs a CSV file and Excel file with format of GENERATED_MAPPED_VOCAB_FILE_NAME_PREFIX plus the current date.

Usage

python process_vocabulary_mapping.py

Features

  • Preserves existing mappings and notes from previous runs
  • Makes API requests to vocabulary endpoint VOCAB_SERVER_SEARCH_BY_TERM_URL for unmapped terms, e.g. https://techna-omop.uhndata.io/dhdp-vocab-search/index.php?term=ORIGINAL_TERM&limit=10&preferredVocabs=SNOMED,LOINC&preferredDomains=&preferredClasses
  • Mapped vocabulary is added into the following columns:
    • Mapped Term
    • Concept ID: Athena concept ID
    • Concept Code: Vocabulary-specific concept code
    • Term Type (Concept/Synonym)
    • Match Type (Exact*, Inexact*, etc.)
    • Class
    • Domain
    • Vocabulary
    • Mapped By: For new mappings, value is set to [Your Name] for easy search and replace
    • Approved By
    • Review URL: URL that links to Vocabulary Search GUI for reviewing mapping results
    • Notes
  • Generates timestamped output files to preserve version history

Match Type Logic

  • "Exact*" - Multiple exact matches found
  • "Inexact*" - Multiple inexact matches found and first result is inexact
  • "Exact" - Single exact match
  • "Inexact" - Single inexact match

4. generate_etl_rules.py

Purpose

Loads Excel MOHCCN_TO_OMOP_ETL_MODEL_XLSX_FILE_NAME and generates GENERATED_MOHCCN_TO_OMOP_ETL_SETTINGS_JSON_FILE_NAME into a JSON structure representing the transformation model from MOHCCN JSON to OMOP CDM JSON, e.g.:

{
  "JSON Level 1": {
    "Data Group 1": {
      "OMOP Table 1": [
        {
		"Target Field": "person_id",
		"Instruction": "Get Unique ID",
		"MOHCCN Source Schema": "Donor",
		"Source Field": "submitter_donor_id",
		"Value": "OMOP Patient Person",
		"Requirement Rule": "Skip Record"
        }
      ],
      "OMOP Table 2": [ /* more rules */ ]
    }
  }
}

Usage

python generate_etl_rules.py

Features

  • Organizes rules hierarchically by:
    • Source JSON Level
    • DHDP Data Group
    • OMOP Table

Transformation Instructions Supported

Fixed Value

Sets a fixed value for an OMOP column regardless of source data. MOHCCN Source Schema and Source Field columns must be set to 'N/A'.

Fixed Value If Not Empty

Sets a fixed value only if the source value is not empty. MOHCCN Source Schema and Source Field columns must be set to 'N/A'.

Copy Value

Copies a value directly from Value column. MOHCCN Source Schema and Source Field columns must be set to 'N/A'. Global Variables are identified by '{global_var_name}' and replaced accordingly.

Generate Unique ID

Generates a unique identifier map to unambigiously identify a generated PK value by a combination:

  • DHDP Data Group, which encapsulated both the target OMOP table as well as unique hierarchical context in source MOHCCN JSON, e.g. 'OMOP Primary Dx Initial Diagnosis Measurement'
  • Source schema and field, e.g. Primary Diagnosis.submitter_primary_diagnosis_id
Generate Unique ID From Path

Generates a unique ID based on a specific JSON path in the source data.

Get Unique ID

Retrieves a previously generated unique ID, potentially using global variables, where the Value column is the associated DHDP Data Group that identifies the hiearchical context.

Get Unique ID (No Global Vars)

Retrieves a unique ID without using global variables.

Get Unique ID From Path

Retrieves a unique ID from a specific JSON path.

Concept ID By Term

Looks up a concept ID from vocabulary mappings if source MOHCCN value is a textual term, e.g. Primary Diagnosis.basis_of_diagnosis.

Concept ID By Code

Looks up a concept ID from vocabulary API endpoint if source MOHCCN value is a code, e.g. Primary Diagnosis.cancer_type_code.

Calculate Year From Interval

Calculates a year value from a date interval JSON object.

Calculate Month From Interval

Calculates a month value from a date interval JSON object.

Calculate Day From Interval

Calculates a day value from a date interval JSON object.

Calculate Date From Interval

Calculates a full date from a date interval JSON object.

Concatenate Values

Concatenates multiple values using a separator (default: |).

Reset Global Vars

Resets global variables used for cross-record data sharing.

Save To Global Vars

Saves a value to global variables for use in subsequent transformations.

Vocabulary Server Setup

The ETL pipeline requires access to an OMOP Vocabulary Server, both to access the standard OMOP Athena vocabularies and to make modifications to the standard vocabulary.

We provide a set of docker compose files in vocabulary_server_setup for setting up the vocabulary server. See vocabulary_server_setup/readme for more details on setting up the server.

Testing

Add information about what tests are available for the software and how to run them

Contributing to MOHCCN-to-OMOP ETL Script Library

See our CONTRIBUTING page for details on how to contribute to this project.

Contributors

Thanks to the following people who have contributed to this project:

License

All CanDIG code is licensed with LGPL-3.0 license.

About

Tools to transform data in mohccn format to OMOP CDM

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors