Data Ingestion Module

This module provides tools for ingesting structured datasets from YAML data hierarchy files through the OpenGIN services. It processes hierarchical data structures (ministers, departments, categories, subcategories, and datasets) and creates corresponding entities and relationships in the OpenGIN system.

Overview

The ingestion system reads YAML data hierarchy files that describe the structure of datasets organized in a flexible hierarchy:

Ministers → Categories → Subcategories → Datasets
Ministers → Categories → Datasets
Ministers → Departments → Categories → Subcategories → Datasets
Ministers → Departments → Categories → Datasets

It also supports ingesting citizen profiles — personal datasets attached to individual citizen entities. Profiles are year-independent and are processed separately using the --profiles flag.

Each dataset is stored as a JSON file and is ingested as an attribute on the appropriate parent entity (category, subcategory, or citizen).

Prerequisites

Before running the ingestion script, ensure you have completed the following setup steps:

1. Start OpenGIN Services

Make sure the OpenGIN services are up and running. The ingestion script requires:

Read Service: For querying existing entities and relationships
Ingestion Service: For creating and updating entities

2. Restore Data Backup

Restore the 0.0.1 data backup to ensure you have the base entities (ministers, departments, etc.) that the ingestion script will reference and build upon.

3. Set Up Python Environment

Create a virtual environment and install the required dependencies.

Option 1: Using Mamba or Conda (Recommended)

# Create the environment from environment.yml
mamba env create -f environment.yml

# Or using Conda:
# conda env create -f environment.yml

# Activate the environment
mamba activate datasets_env

# Or using Conda:
# conda activate datasets_env

Option 2: Using Python venv

# Create a virtual environment
python -m venv venv

# Activate the virtual environment

# On macOS/Linux:
source venv/bin/activate

# On Windows:
# venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

4. Configure Environment Variables

The ingestion script requires the following environment variables:

READ_BASE_URL: Base URL for the OpenGin Read Service
INGESTION_BASE_URL: Base URL for the OpenGin Ingestion Service

You can set these in your environment or create a .env file in the ingestion/ directory:

export READ_BASE_URL="http://localhost:8081"
export INGESTION_BASE_URL="http://localhost:8080"

Or create a .env file:

READ_BASE_URL=http://localhost:8081
INGESTION_BASE_URL=http://localhost:8080

If using a .env file, make sure you have python-dotenv installed (included in requirements.txt).

Usage

Once all prerequisites are met, you can run the ingestion script:

# From the project root directory
python -m ingestion.ingest_data_yaml data/statistics/2020/data_hierarchy_2020.yaml

# Or with an explicit year override
python -m ingestion.ingest_data_yaml data/statistics/2020/data_hierarchy_2020.yaml --year 2020

# Ingest citizen profiles (no year required)
python -m ingestion.ingest_data_yaml data/people/profiles_hierarchy.yaml --profiles

Command Line Arguments

Argument	Type	Description
`yaml_file`	required	Path to the YAML manifest file
`--year`	optional	Override the year extracted from the filename
`--profiles`	optional flag	Process a profiles YAML — skips year extraction and only processes citizen entries

Standard Mode (without `--profiles`)

Used for ingesting statistical datasets organised by ministers, departments, and categories. A year is required (either extracted from the filename or provided via --year).

# Ingest 2020 data
python -m ingestion.ingest_data_yaml data/statistics/2020/data_hierarchy_2020.yaml

# Ingest 2021 data
python -m ingestion.ingest_data_yaml data/statistics/2021/data_hierarchy_2021.yaml

In this mode the script will:

Extract the year from the filename (or use --year)
Process governments, ministers, departments, categories, and their datasets

Profiles Mode (`--profiles`)

Used for ingesting citizen profile datasets. Profiles are personal data records (e.g. name, political party, date of birth) attached to existing citizen entities. They are not year-specific.

# Ingest citizen profiles
python -m ingestion.ingest_data_yaml data/people/profiles_hierarchy.yaml --profiles

In this mode the script will:

Skip year extraction entirely
Skip government and minister processing
Only process citizen entries and attach their profile datasets as attributes

Identifying citizens in the YAML

Each citizen entry in the profiles YAML can be identified by name or id. Note, if both are provided, the id takes priority.

citizen:
  # By name — looks up the citizen entity by name
  - name: Harini Amarasuriya
    profile:
    - profiles/Harini Amarasuriya

  # By id — looks up the citizen entity by ID, resolves the name from the returned entity
  - id: 2151-38_cit_14
    profile:
    - profiles/Anura Karunathilaka

  # Both — id takes priority, name is ignored for lookup
  - id: 2151-38_cit_14
    name: Anura Karunathilaka
    profile:
    - profiles/Anura Karunathilaka

Note

The citizen entity must already exist in the system. If it cannot be found by the given name or id, the entry is skipped with an error log. When looking up by id, the entity must be of kind Person/citizen — any other kind will also be skipped.

How It Works

Standard Mode

Parse YAML Manifest: Reads the YAML file to extract the hierarchical structure
Find Entities: Uses the Read Service to find existing ministers and departments by name and year
Create Categories: Creates category and subcategory entities as needed
Process Datasets: Reads dataset JSON files and adds them as attributes to parent entities
Create Relationships: Establishes relationships between entities (e.g., AS_CATEGORY)

Profiles Mode (`--profiles`)

Parse YAML Manifest: Reads the profiles YAML file to extract citizen entries
Find Citizens: Uses the Read Service to look up existing citizen entities — by id if provided, otherwise by name
Process Profiles: Reads each citizen's profile data.json and attaches it as an attribute on the citizen entity using the citizen's own start/end time period

Module Structure

ingestion/
├── ingest_data_yaml.py      # Main ingestion script
├── .env                     # Environment variables
├── exception/               # Exception handling
│   └── exceptions.py
├── models/                  # Data models and schemas
│   └── schema.py
├── services/                # Service layer
│   ├── entity_resolver.py   # Entity lookup and resolution
│   ├── ingestion_service.py # OpenGin Ingestion API client
│   ├── read_service.py      # OpenGin Read API client
│   └── yaml_parser.py       # YAML parsing utilities
├── utils/                   # Utility functions
│   ├── date_utils.py        # Date/time calculations
│   ├── http_client.py       # HTTP client for API calls
│   └── util_functions.py    # General utilities
└── requirements.txt         # Python dependencies

Troubleshooting

ModuleNotFoundError: No module named 'ingestion'

Make sure you're running the command from the project root directory (/Users/LDF/Documents/datasets/), not from within the ingestion/ folder.

Missing Dependencies

If you encounter import errors, ensure all dependencies are installed:

pip install -r requirements.txt

Environment Variables Not Set

The script will exit with an error if READ_BASE_URL or INGESTION_BASE_URL are not set. Make sure these are configured before running.

Connection Errors

If you see connection errors, verify that:

OpenGin services are running
The base URLs in your environment variables are correct
Your network/firewall allows connections to these services

Notes

The script processes ministers and citizens sequentially
Datasets are validated before ingestion; null values in rows are automatically converted to empty strings
The script handles time period calculations for attributes based on parent entity time ranges and dataset years
Categories and subcategories are checked for existence before creation to avoid duplicates
In profiles mode, citizen entities must already exist in the system — if a citizen is not found, it is skipped with an error log
If multiple citizen entities are found with the same name, a warning is logged and the first result is used

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Ingestion Module

Overview

Prerequisites

1. Start OpenGIN Services

2. Restore Data Backup

3. Set Up Python Environment

4. Configure Environment Variables

Usage

Command Line Arguments

Standard Mode (without `--profiles`)

Profiles Mode (`--profiles`)

Identifying citizens in the YAML

How It Works

Standard Mode

Profiles Mode (`--profiles`)

Module Structure

Troubleshooting

ModuleNotFoundError: No module named 'ingestion'

Missing Dependencies

Environment Variables Not Set

Connection Errors

Notes

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Data Ingestion Module

Overview

Prerequisites

1. Start OpenGIN Services

2. Restore Data Backup

3. Set Up Python Environment

4. Configure Environment Variables

Usage

Command Line Arguments

Standard Mode (without --profiles)

Profiles Mode (--profiles)

Identifying citizens in the YAML

How It Works

Standard Mode

Profiles Mode (--profiles)

Module Structure

Troubleshooting

ModuleNotFoundError: No module named 'ingestion'

Missing Dependencies

Environment Variables Not Set

Connection Errors

Notes

Standard Mode (without `--profiles`)

Profiles Mode (`--profiles`)

Profiles Mode (`--profiles`)